Gateway 网关坑我!被这个404 问题折腾了一年?

最近同事找我帮忙排查一个"诡异"的 Bug,说困扰了他们一年多一直没解决。我接手后花了一些时间定位到了问题根源,今天就来跟大家分享一下这个问题的排查过程和解决方案。

问题描述

同事使用的是 SpringCloud Gateway 3.0.1 + JDK8,整合了 Nacos 做动态路由配置。问题是:每次修改 Nacos 的路由配置后,网关的 API 请求就会出现 404 错误,但重启网关后又能恢复正常。

听到这个问题,我的第一反应是:Nacos 配置更新后,网关的缓存数据可能没有及时更新。带着这个猜想,我开始深入排查。

环境准备

首先准备了 3 个后端服务实例,端口分别为 81031204012041,在 Nacos 中配置了对应的网关路由:xiaofu-8103xiaofu-12040xiaofu-12041,并将它们放在同一个权重组 xiaofu-group 中,实现基于权重的负载均衡。

复制
- id: xiaofu-8103 uri: http://127.0.0.1:8103/ predicates: - Weight=xiaofu-group, 2 - Path=/test/version1/** filters: - RewritePath=/test/version1/(?<segment>.*),/$\{segment} - id: xiaofu-12040 uri: http://127.0.0.1:12040/ predicates: - Weight=xiaofu-group, 1 - Path=/test/version1/** filters: - RewritePath=/test/version1/(?<segment>.*),/$\{segment} - id: xiaofu-12041 uri: http://127.0.0.1:12041/ predicates: - Weight=xiaofu-group, 2 - Path=/test/version1/** filters: - RewritePath=/test/version1/(?<segment>.*),/$\{segment}1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.

使用 JMeter 进行持续请求测试,为了便于日志追踪,给每个请求参数都添加了随机数。

图片

准备完成后启动 JMeter 循环请求,观察到三个实例都有日志输出,说明网关的负载均衡功能正常。

图片

问题排查

为了获取更详细的日志信息,我将网关的日志级别调整为 TRACE

启动 JMeter 后,随机修改三个实例的路由属性(uri、port、predicates、filters),请求没有出现报错,网关控制台也显示了更新后的路由属性,说明 Nacos 配置变更已成功同步到网关。

图片

接下来尝试去掉一个实例 xiaofu-12041,这时发现 JMeter 请求开始出现 404 错误,成功复现问题!

图片

查看网关控制台日志时,惊奇地发现已删除的实例 xiaofu-12041 的路由配置仍然存在,甚至还被选中(chosen)处理请求。

问题根源找到了:虽然 Nacos 中删除了实例路由配置,但网关在实际负载均衡时仍然使用旧的路由数据。

图片

继续深入排查,发现在路由的权重信息(Weights attr)中也存在旧的路由数据。

至此基本确定问题:在计算实例权重和负载均衡时,网关使用了陈旧的缓存数据。

图片

源码分析

通过分析源码,发现了一个专门计算权重的过滤器 WeightCalculatorWebFilter。它内部维护了一个 groupWeights 变量来存储路由权重信息。

当配置变更事件发生时,会执行 addWeightConfig(WeightConfig weightConfig) 方法来添加权重配置。

复制
@Override public void onApplicationEvent(ApplicationEvent event) { if (event instanceof PredicateArgsEvent) { handle((PredicateArgsEvent) event); } else if (event instanceof WeightDefinedEvent) { addWeightConfig(((WeightDefinedEvent) event).getWeightConfig()); } else if (event instanceof RefreshRoutesEvent && routeLocator != null) { if (routeLocatorInitialized.compareAndSet(false, true)) { routeLocator.ifAvailable(locator -> locator.getRoutes().blockLast()); } else { routeLocator.ifAvailable(locator -> locator.getRoutes().subscribe()); } } }1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.

addWeightConfig 方法的注释明确说明:该方法仅创建新的 GroupWeightConfig,而不进行修改。

这意味着它只能新建或覆盖路由权重,无法清理已删除的路由权重信息。

复制
void addWeightConfig(WeightConfig weightConfig) { String group = weightConfig.getGroup(); GroupWeightConfig config; // only create new GroupWeightConfig rather than modify // and put at end of calculations. This avoids concurency problems // later during filter execution. if (groupWeights.containsKey(group)) { config = new GroupWeightConfig(groupWeights.get(group)); } else { config = new GroupWeightConfig(group); } final AtomicInteger index = new AtomicInteger(0); ....省略..... if (log.isTraceEnabled()) { log.trace("Recalculated group weight config " + config); } // only update after all calculations groupWeights.put(group, config); }1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.
解决方案

找到问题根源后,解决方案就清晰了。

开始我怀疑可能是springcloud gateway 版本问题,将版本升级到了4.1.0,但结果还是存在这个问题。

图片

看来只能手动更新缓存解决了,需要监听 Nacos 路由配置变更事件,获取最新路由配置,并更新 groupWeights 中的权重数据。

以下是实现的解决方案代码:

复制
@Slf4j @Configuration public class WeightCacheRefresher { @Autowired private WeightCalculatorWebFilter weightCalculatorWebFilter; @Autowired private RouteDefinitionLocator routeDefinitionLocator; @Autowired private ApplicationEventPublisher publisher; /** * 监听路由刷新事件,同步更新权重缓存 */ @EventListener(RefreshRoutesEvent.class) public void onRefreshRoutes() { log.info("检测到路由刷新事件,准备同步更新权重缓存"); syncWeightCache(); } /** * 同步权重缓存与当前路由配置 */ public void syncWeightCache() { try { // 获取 groupWeights 字段 Field groupWeightsField = WeightCalculatorWebFilter.class.getDeclaredField("groupWeights"); groupWeightsField.setAccessible(true); // 获取当前的 groupWeights 值 @SuppressWarnings("unchecked") Map<String, Object> groupWeights = (Map<String, Object>) groupWeightsField.get(weightCalculatorWebFilter); if (groupWeights == null) { log.warn("未找到 groupWeights 缓存"); return; } log.info("当前 groupWeights 缓存: {}", groupWeights.keySet()); // 获取当前所有路由的权重组和路由ID final Set<String> currentRouteIds = new HashSet<>(); final Map<String, Map<String, Integer>> currentGroupRouteWeights = new HashMap<>(); routeDefinitionLocator.getRouteDefinitions() .collectList() .subscribe(definitions -> { definitions.forEach(def -> { currentRouteIds.add(def.getId()); def.getPredicates().stream() .filter(predicate -> predicate.getName().equals("Weight")) .forEach(predicate -> { Map<String, String> args = predicate.getArgs(); String group = args.getOrDefault("_genkey_0", "unknown"); int weight = Integer.parseInt(args.getOrDefault("_genkey_1", "0")); // 记录每个组中当前存在的路由及其权重 currentGroupRouteWeights.computeIfAbsent(group, k -> new HashMap<>()) .put(def.getId(), weight); }); }); log.info("当前路由配置中的路由ID: {}", currentRouteIds); log.info("当前路由配置中的权重组: {}", currentGroupRouteWeights); // 检查每个权重组,移除不存在的路由,更新权重变化的路由 Set<String> groupsToRemove = new HashSet<>(); Set<String> groupsToUpdate = new HashSet<>(); for (String group : groupWeights.keySet()) { if (!currentGroupRouteWeights.containsKey(group)) { // 整个权重组不再存在 groupsToRemove.add(group); log.info("权重组 [{}] 不再存在于路由配置中,将被移除", group); continue; } // 获取该组中当前配置的路由ID和权重 Map<String, Integer> configuredRouteWeights = currentGroupRouteWeights.get(group); // 获取该组中缓存的权重配置 Object groupWeightConfig = groupWeights.get(group); try { // 获取 weights 字段 Field weightsField = groupWeightConfig.getClass().getDeclaredField("weights"); weightsField.setAccessible(true); @SuppressWarnings("unchecked") LinkedHashMap<String, Integer> weights = (LinkedHashMap<String, Integer>) weightsField.get(groupWeightConfig); // 找出需要移除的路由ID Set<String> routesToRemove = weights.keySet().stream() .filter(routeId -> !configuredRouteWeights.containsKey(routeId)) .collect(Collectors.toSet()); // 找出权重发生变化的路由ID Set<String> routesWithWeightChange = new HashSet<>(); for (Map.Entry<String, Integer> entry : weights.entrySet()) { String routeId = entry.getKey(); Integer cachedWeight = entry.getValue(); if (configuredRouteWeights.containsKey(routeId)) { Integer configuredWeight = configuredRouteWeights.get(routeId); if (!cachedWeight.equals(configuredWeight)) { routesWithWeightChange.add(routeId); log.info("路由 [{}] 的权重从 {} 变为 {}", routeId, cachedWeight, configuredWeight); } } } // 找出新增的路由ID Set<String> newRoutes = configuredRouteWeights.keySet().stream() .filter(routeId -> !weights.containsKey(routeId)) .collect(Collectors.toSet()); if (!routesToRemove.isEmpty() || !routesWithWeightChange.isEmpty() || !newRoutes.isEmpty()) { log.info("权重组 [{}] 中有变化:删除 {},权重变化 {},新增 {}", group, routesToRemove, routesWithWeightChange, newRoutes); // 如果有任何变化,我们将重新计算整个组的权重 groupsToUpdate.add(group); } // 首先,移除需要删除的路由 for (String routeId : routesToRemove) { weights.remove(routeId); } // 如果权重组中没有剩余路由,则移除整个组 if (weights.isEmpty()) { groupsToRemove.add(group); log.info("权重组 [{}] 中没有剩余路由,将移除整个组", group); } } catch (Exception e) { log.error("处理权重组 [{}] 时出错", group, e); } } // 移除不再需要的权重组 for (String group : groupsToRemove) { groupWeights.remove(group); log.info("已移除权重组: {}", group); } // 更新需要重新计算的权重组 for (String group : groupsToUpdate) { try { // 获取该组中当前配置的路由ID和权重 Map<String, Integer> configuredRouteWeights = currentGroupRouteWeights.get(group); // 移除旧的权重组配置 groupWeights.remove(group); log.info("已移除权重组 [{}] 以便重新计算", group); // 为每个路由创建 WeightConfig 并调用 addWeightConfig 方法 Method addWeightConfigMethod = WeightCalculatorWebFilter.class.getDeclaredMethod("addWeightConfig", WeightConfig.class); addWeightConfigMethod.setAccessible(true); for (Map.Entry<String, Integer> entry : configuredRouteWeights.entrySet()) { String routeId = entry.getKey(); Integer weight = entry.getValue(); WeightConfig weightConfig = new WeightConfig(routeId); weightConfig.setGroup(group); weightConfig.setWeight(weight); addWeightConfigMethod.invoke(weightCalculatorWebFilter, weightConfig); log.info("为路由 [{}] 添加权重配置:组 [{}],权重 {}", routeId, group, weight); } } catch (Exception e) { log.error("重新计算权重组 [{}] 时出错", group, e); } } log.info("权重缓存同步完成,当前缓存的权重组: {}", groupWeights.keySet()); }); } catch (Exception e) { log.error("同步权重缓存失败", e); } } }1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.96.97.98.99.100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116.117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134.135.136.137.138.139.140.141.142.143.144.145.146.147.148.149.150.151.152.153.154.155.156.157.158.159.160.161.162.163.164.165.166.167.168.169.170.171.172.173.174.175.176.177.178.179.180.181.182.183.184.185.186.

如此一来每次更新nacos路由配置,就会监听到配置变更事件,进而用最新的实例数据来更新本地的路由权重数据。

网上找一圈并没发现官方的修改意见,可能是咱们使用方式不对导致的,要不如此明显的BUG早就有人改了吧!

THE END