Nginx upstream 健康检查与被动熔断配置

前面三章，我们完成了架构设计、原理学习和健康检查服务的开发。现在终于到了真正让 Nginx 动起来的时候。

本章将手把手配置 Nginx 的 upstream 健康检查 和 被动熔断 机制，让 Nginx 能够自动识别并摘除故障节点。

一、Nginx 在这个架构中的角色

回顾一下整体架构：

Nginx 在这一层承担了三个关键职责：

职责	说明
反向代理	接收所有请求，转发给后端的 Master 集群
健康检查	定期探测 Master 是否存活，自动摘除故障节点
负载均衡	多个 Master 节点之间分发请求（可选）

💡 在我们的架构中，Nginx 和后端 Master 是同机部署的（Nginx 和 Master 运行在同一台物理机上）。这意味着：
Nginx 代理的是 127.0.0.1:8080，而不是其他机器的 IP
健康检查针对的是本机的 Master 服务

二、环境准备：安装 Nginx

在开始配置之前，需要在三台机器（192.168.0.201、192.168.0.202、192.168.0.203）上分别完成 Nginx 的安装。

2.1 安装 Nginx

bash

# 更新软件包索引
sudo apt update

# 安装 Nginx
sudo apt install -y nginx

# 查看版本，确认安装成功
nginx -v
# 输出：nginx version: nginx/1.18.0 (Ubuntu)

2.2 验证 Nginx 是否正常运行

bash

# 检查 Nginx 服务状态
sudo systemctl status nginx

# 测试 Nginx 默认页面
curl http://localhost
# 应该能看到 Nginx 的欢迎页面

2.3 确保 Nginx 开机自启

bash

sudo systemctl enable nginx

2.4 确认安装路径和配置文件位置

bash

# 查看 Nginx 安装路径
whereis nginx
# nginx: /usr/sbin/nginx /etc/nginx /usr/share/nginx

# 查看主配置文件
ls -la /etc/nginx/
# 重点关注：nginx.conf、sites-available/、sites-enabled/

💡 本节目标：三台机器全部完成 Nginx 安装，为后续的 upstream 配置做好准备。

三、Nginx 健康检查的两种模式

Nginx 提供了两种健康检查机制：

模式	原理	优点	缺点
被动检查	根据请求的响应状态（超时、拒绝、5xx）判断节点是否健康	无需额外配置，自动生效	故障节点至少会被请求一次才摘除
主动检查	Nginx 定期主动发送探测请求（如 `/health`）	故障提前发现，不会影响到业务请求	需要 Nginx Plus 或第三方模块

⚠️ 重要说明：Nginx 开源版不支持主动健康检查（主动发送探测请求）。主动检查是 Nginx Plus 付费版的功能。
但我们可以通过 被动检查 + Keepalived 健康检查脚本 的组合来达到同样的效果。这也是大多数生产环境的选择。关于 Keepalived 的配置，我们会在第 5 篇中完整覆盖。

四、被动健康检查配置

被动检查依赖几个核心参数：

参数	作用	推荐值
`max_fails`	允许失败的最大次数，超过后标记为 `down`	2
`fail_timeout`	失败计数的时间窗口，同时也是标记 `down` 后的恢复尝试间隔	10s
`proxy_connect_timeout`	连接后端超时时间	3s
`proxy_read_timeout`	读取后端响应超时时间	30s

4.1 基础 upstream 配置

在三台机器上重写默认的 nginx 站点配置文件：

bash

sudo vim /etc/nginx/sites-available/default

但是，具体到每台服务器上的 /etc/nginx/sites-available/default 配置文件会有细微的区别，主要是 upstream 的配置上！

192.168.0.201192.168.0.202192.168.0.203

nginx

# /etc/nginx/sites-available/default
upstream master_backend {
    # 同机部署，代理到本机的 8080 端口
    server 127.0.0.1:8080 max_fails=2 fail_timeout=10s;

    # 其他 Master 节点
    server 192.168.0.202:8080 max_fails=2 fail_timeout=10s;  
    server 192.168.0.203:8080 max_fails=2 fail_timeout=10s backup;  

    # 保持长连接，提高性能
    keepalive 32;
}

server {
    listen 80;
    server_name _;

    # 客户端请求体大小限制
    client_max_body_size 10M;

    location / {
        proxy_pass http://master_backend;

        # 请求头透传
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 超时配置（影响被动检查的判断）
        proxy_connect_timeout 3s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;

        # 长连接配置
        proxy_http_version 1.1;
        proxy_set_header Connection "";

        # 失败重试配置
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 10s;
    }

    # 独立的健康检查接口（可选，用于外部探测）
    location /nginx-health {
        access_log off;
        return 200 "nginx ok\n";
        add_header Content-Type text/plain;
    }
}

nginx

# /etc/nginx/sites-available/default
upstream master_backend {
    # 同机部署，代理到本机的 8080 端口
    server 127.0.0.1:8080 max_fails=2 fail_timeout=10s;

    # 其他 Master 节点
    server 192.168.0.201:8080 max_fails=2 fail_timeout=10s;  
    server 192.168.0.203:8080 max_fails=2 fail_timeout=10s backup;  

    # 保持长连接，提高性能
    keepalive 32;
}

server {
    listen 80;
    server_name _;

    # 客户端请求体大小限制
    client_max_body_size 10M;

    location / {
        proxy_pass http://master_backend;

        # 请求头透传
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 超时配置（影响被动检查的判断）
        proxy_connect_timeout 3s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;

        # 长连接配置
        proxy_http_version 1.1;
        proxy_set_header Connection "";

        # 失败重试配置
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 10s;
    }

    # 独立的健康检查接口（可选，用于外部探测）
    location /nginx-health {
        access_log off;
        return 200 "nginx ok\n";
        add_header Content-Type text/plain;
    }
}

nginx

# /etc/nginx/sites-available/default
upstream master_backend {
    # 同机部署，代理到本机的 8080 端口
    server 127.0.0.1:8080 max_fails=2 fail_timeout=10s;

    # 其他 Master 节点
    server 192.168.0.201:8080 max_fails=2 fail_timeout=10s;  
    server 192.168.0.202:8080 max_fails=2 fail_timeout=10s backup;  

    # 保持长连接，提高性能
    keepalive 32;
}

server {
    listen 80;
    server_name _;

    # 客户端请求体大小限制
    client_max_body_size 10M;

    location / {
        proxy_pass http://master_backend;

        # 请求头透传
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 超时配置（影响被动检查的判断）
        proxy_connect_timeout 3s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;

        # 长连接配置
        proxy_http_version 1.1;
        proxy_set_header Connection "";

        # 失败重试配置
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 10s;
    }

    # 独立的健康检查接口（可选，用于外部探测）
    location /nginx-health {
        access_log off;
        return 200 "nginx ok\n";
        add_header Content-Type text/plain;
    }
}

4.2 应用配置

执行以下命令，测试配置是否正确：

bash

nginx -t
# 输出：
# nginx: configuration file /etc/nginx/nginx.conf test is successful

执行以下命令，平滑热加载（零停机更新配置）：

bash

nginx -s reload

# 或使用 systemctl
sudo systemctl reload nginx

4.3 参数详解

被动检查工作原理：

proxy_next_upstream 配置解读：

nginx

proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;

条件	说明	是否触发重试
`error`	连接后端时发生错误	✅
`timeout`	连接、读取或发送超时	✅
`invalid_header`	后端返回无效的响应头	✅
`http_500`	后端返回 500 Internal Server Error	✅
`http_502`	后端返回 502 Bad Gateway	✅
`http_503`	后端返回 503 Service Unavailable	✅
`http_404`	后端返回 404（默认不触发）	❌

五、验证 Nginx 被动检查

在部署 Keepalived 之前，我们先单独验证 Nginx 的被动健康检查机制是否正常工作。

5.1 确保 Master 服务正常运行

bash

# 在三台机器上分别确认 gateway 服务正在运行
sudo systemctl status gateway

# 如果没有运行，启动它
sudo systemctl start gateway

5.2 模拟 Master 服务故障

在 node1（192.168.0.201） 上停止 Master 服务：

bash

sudo systemctl stop gateway

5.3 发送请求触发被动检查

bash

# 在 node1（192.168.0.201）上本地测试
# 因为 Nginx 代理的是 127.0.0.1:8080，直接访问本机 Nginx 即可

for i in {1..5}; do
    curl -X POST http://127.0.0.1/api/v1/task \
         -H "Content-Type: application/json" \
         -d '{"test": "failover"}' \
         -w "\nHTTP Status: %{http_code}\n"
    sleep 1
done

5.4 观察 Nginx 错误日志

bash

# 在 node1 上查看 Nginx 错误日志
sudo tail -f /var/log/nginx/error.log

# 预期看到类似日志：
# [error] upstream timed out (110: Connection timed out) while connecting to upstream
# [error] no live upstreams while connecting to upstream

5.5 恢复 Master 服务并验证

bash

# 恢复 Master 服务
sudo systemctl start gateway

# 再次发送请求，观察节点重新变为可用
curl -X POST http://127.0.0.1/api/v1/task \
     -H "Content-Type: application/json" \
     -d '{"test": "recovery"}'

# 预期：请求成功，返回 200

5.6 验证结论

测试步骤	预期结果	是否通过
停止 Master 服务	服务不可用	通过
发送请求（第1-2次）	Nginx 记录超时/错误	通过
发送请求（第3次起）	Nginx 直接返回错误，不再尝试转发	通过
恢复 Master 服务	等待 `fail_timeout` 后重新可用	通过

💡 说明：本章验证用的是 127.0.0.1（本机 Nginx），不依赖 Keepalived。VIP 的验证将在第 5 篇部署完 Keepalived 后进行。

六、常见问题与解决方案

Q1：Nginx 开源版真的没有主动健康检查吗？

确实没有。 Nginx 开源版只支持被动健康检查。如果需要主动检查，可以考虑：

购买 Nginx Plus
使用 OpenResty（基于 Nginx 的增强版）
使用第三方模块 nginx-upstream-check-module（需要重新编译 Nginx）
使用 HAProxy 替代（原生支持主动检查）

Q2：`max_fails` 和 `fail_timeout` 应该如何设置？

这取决于你的业务容忍度：

场景	max_fails	fail_timeout	说明
高可用要求严格	1	5s	快速摘除，但可能误判
一般生产环境	2	10s	推荐配置，平衡速度和稳定性
网络不稳定	3	30s	避免频繁切换

Q3：如何让 Nginx 在重试时避开故障节点？

proxy_next_upstream 配合 max_fails 会自动实现。当一个节点失败次数达到 max_fails，Nginx 会在 fail_timeout 内不再将请求转发给它。

Q4：Nginx 和 Keepalived 的健康检查有什么区别？

组件	检查对象	发现问题后的动作
Nginx	后端 Master 服务	临时摘除节点，但 VIP 还在本机
Keepalived	Nginx + Master 服务	VIP 漂移到其他节点

两者配合形成完整的故障防护链。Keepalived 的配置将在第 5 篇中完整覆盖。

Q5：三台机器都要配置 Nginx 吗？

是的。 因为 VIP 可能漂移到任意一台机器，所以三台机器都需要安装 Nginx 并保持一致配置。这样无论 VIP 漂移到哪台机器，Nginx 都能正常工作。

Q6：如何确保三台机器的 Nginx 配置一致？

推荐使用配置管理工具（如 Ansible）同步配置，或者手动在三台机器上执行相同的配置步骤。我们在本系列的后续章节中会提供一键部署脚本。

七、小结

本章完成了 Nginx 的 upstream 健康检查和熔断配置，核心要点：

环境准备：三台机器安装 Nginx，确认服务正常运行
被动检查：通过 max_fails + fail_timeout 实现自动摘除故障节点
超时配置：proxy_connect_timeout 和 proxy_read_timeout 影响故障判定
重试策略：proxy_next_upstream 定义哪些错误需要重试
独立验证：无需 Keepalived，单独验证 Nginx 被动检查机制

下一章，我们将正式部署 Keepalived，配置 VIP 漂移，让高可用入口真正运转起来。

💡 本文是《分布式高可用入口架构实战系列》第 4 篇
点击查看全部文章
上一篇：Golang + Gin 实现一个带健康检查的 Web 服务
下一篇：Keepalived 部署与 VIP 漂移配置指南

Nginx upstream 健康检查与被动熔断配置 ​

一、Nginx 在这个架构中的角色 ​

二、环境准备：安装 Nginx ​

2.1 安装 Nginx ​

2.2 验证 Nginx 是否正常运行 ​

2.3 确保 Nginx 开机自启 ​

2.4 确认安装路径和配置文件位置 ​

三、Nginx 健康检查的两种模式 ​

四、被动健康检查配置 ​

4.1 基础 upstream 配置 ​

4.2 应用配置 ​

4.3 参数详解 ​

五、验证 Nginx 被动检查 ​

5.1 确保 Master 服务正常运行 ​

5.2 模拟 Master 服务故障 ​

5.3 发送请求触发被动检查 ​

5.4 观察 Nginx 错误日志 ​

5.5 恢复 Master 服务并验证 ​

5.6 验证结论 ​

六、常见问题与解决方案 ​

Q1：Nginx 开源版真的没有主动健康检查吗？ ​

Q2：max_fails 和 fail_timeout 应该如何设置？ ​

Q3：如何让 Nginx 在重试时避开故障节点？ ​

Q4：Nginx 和 Keepalived 的健康检查有什么区别？ ​

Q5：三台机器都要配置 Nginx 吗？ ​

Q6：如何确保三台机器的 Nginx 配置一致？ ​

七、小结 ​

Nginx upstream 健康检查与被动熔断配置

一、Nginx 在这个架构中的角色

二、环境准备：安装 Nginx

2.1 安装 Nginx

2.2 验证 Nginx 是否正常运行

2.3 确保 Nginx 开机自启

2.4 确认安装路径和配置文件位置

三、Nginx 健康检查的两种模式

四、被动健康检查配置

4.1 基础 upstream 配置

4.2 应用配置

4.3 参数详解

五、验证 Nginx 被动检查

5.1 确保 Master 服务正常运行

5.2 模拟 Master 服务故障

5.3 发送请求触发被动检查

5.4 观察 Nginx 错误日志

5.5 恢复 Master 服务并验证

5.6 验证结论

六、常见问题与解决方案

Q1：Nginx 开源版真的没有主动健康检查吗？

Q2：`max_fails` 和 `fail_timeout` 应该如何设置？

Q3：如何让 Nginx 在重试时避开故障节点？

Q4：Nginx 和 Keepalived 的健康检查有什么区别？

Q5：三台机器都要配置 Nginx 吗？

Q6：如何确保三台机器的 Nginx 配置一致？

七、小结