Nginx 扛不住 15 万日活，花两天调完 QPS 翻了三倍

服务器日活从 5 万涨到 15 万那周，Nginx 开始出问题了。高峰期 502 一片一片地冒，用户反馈页面打不开，error.log 里全是 upstream timed out。花了两天排查和调整，最后 QPS 从 500 拉到了 1500，502 基本消失。把过程记一下。

## 出了什么问题

具体表现：

高峰期 502 Bad Gateway 频繁出现
响应时间从 200ms 飙到 3 秒以上
CPU 长时间 80%+
error.log 刷屏：upstream timed out (110: Connection timed out) while reading response header from upstream

一看就是后端来不及响应，Nginx 等超时直接断了。

## 先看系统状态

tail -f /var/log/nginx/error.log

# CPU 和内存
top

# 当前连接数
netstat -an | grep ESTABLISHED | wc -l

# 文件描述符用了多少
lsof | wc -l

几个问题很明显：Nginx worker 进程 CPU 占用高，连接数快到系统上限，文件描述符也快用完了。不是某一个点的问题，得一项一项调。

## 改了什么

### Worker 进程数

Nginx 默认就 1 个 worker，服务器 4 核只用了 1 核，浪费。改成 auto 让它自动匹配核心数：

# /etc/nginx/nginx.conf
worker_processes auto;
worker_cpu_affinity auto;

改完后 top 里能看到 4 个 worker，CPU 分布均匀多了。

### 连接数上限

默认连接数太小，高并发直接打满：

events {
    worker_connections 10240;
    use epoll;
    multi_accept on;
}

系统层面也要跟着改，不然内核那边还是卡：

# /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535

# /etc/sysctl.conf
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8192

sysctl -p
systemctl restart nginx

### Gzip 压缩

没开 Gzip 之前，一个页面 50KB 的 HTML 原样传输，JS 文件 200KB 也是。开了之后 HTML 降到 15KB，JS 降到 60KB，带宽省了一大截：

http {
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_comp_level 6;
    gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
    gzip_disable "msie6";
}

压缩级别用 6，再往上压缩率提升很小但 CPU 开销翻倍，不划算。

### 静态资源缓存

图片、CSS、JS 这些不怎么变的文件，让浏览器缓存 30 天，减少重复请求：

location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
    expires 30d;
    add_header Cache-Control "public, immutable";
}

location ~* \.(html)$ {
    expires 1h;
    add_header Cache-Control "public";
}

### 超时时间

之前超时太短，后端还在处理 Nginx 就把连接断了，改成 60 秒：

http {
    proxy_connect_timeout 60s;
    proxy_send_timeout 60s;
    proxy_read_timeout 60s;
    
    keepalive_timeout 65;
    keepalive_requests 100;
}

### HTTP/2

HTTP/2 多路复用，一个 TCP 连接跑多个请求，握手次数少了：

server {
    listen 443 ssl http2;
    
    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;
    
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;
    ssl_protocols TLSv1.2 TLSv1.3;
}

### Upstream 连接池

后端两个应用进程，配连接池，避免每次请求都新建 TCP 连接：

upstream backend {
    server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8081 max_fails=3 fail_timeout=30s;
    
    keepalive 32;
}

location / {
    proxy_pass http://backend;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
}

### 限流

顺手加了限流，防止有人恶意刷接口把服务器打崩：

http {
    limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
    limit_conn_zone $binary_remote_addr zone=addr:10m;
}

server {
    location / {
        limit_req zone=one burst=20 nodelay;
        limit_conn addr 10;
    }
}

每秒 10 个请求，突发允许 20 个，单 IP 最多 10 个并发连接。正常用户不受影响，爬虫和攻击能挡掉大部分。

## 调完之后的效果

用 ab 跑了一下压测：

ab -n 1000 -c 100 https://www.surepi.cn/

# 优化前
Requests per second: 487
Time per request: 205ms

# 优化后
Requests per second: 1523
Time per request: 65ms

日常运行数据变化：并发从 500 QPS 到 1500 QPS，响应时间从 800ms 降到 150ms，CPU 占用从 80% 降到 35%，带宽从 100Mbps 降到 40Mbps。最直观的是 502 错误率从 5% 降到了 0.1%，用户那边基本没投诉了。

## 几个容易踩的坑

worker_processes 设多少看 CPU 核心数，4 核就 4 个或者直接 auto。别设太多，超过核心数反而增加上下文切换开销。

改完配置别直接 restart，先 nginx -t 检查语法，没问题再 systemctl reload nginx 平滑重载，不会断开已有连接。

如果这些都调完还是慢，大概率问题在后端应用，不在 Nginx。Nginx 就是个代理，后端处理慢它也没辙。后端优化是另一个话题了，改天再写。

Nginx 扛不住 15 万日活，花两天调完 QPS 翻了三倍

评论

发表看法