场景描述:

今天收到线上一个服务的instance deregistered的告警邮件,在springboot admin上确实那个instance不在线了。于是去服务器上jps看了一下,进程正常的,但是/health端点信息显示eureka为down。查看error.log,并没有错误日志,查看warn.log,发现里面有些许的警告日志如下:

[2018-09-20T19:12:24.159+08:00] WARN: [TID: N/A] [Apollo-Config-1] DiscoveryClient.java:1291 - Saw local status change event StatusChangeEvent [timestamp=1537441944159, current=DOWN, previous=UP]
[2018-09-20T19:12:24.160+08:00] WARN: [TID: N/A] [Apollo-Config-1] DiscoveryClient.java:1291 - Saw local status change event StatusChangeEvent [timestamp=1537441944160, current=UP, previous=DOWN]
[2018-09-20T19:12:24.160+08:00] WARN: [TID: N/A] [DiscoveryClient-InstanceInfoReplicator-0] DiscoveryClient.java:1291 - Saw local status change event StatusChangeEvent [timestamp=1537441944160, current=DOWN, previous=UP]
[2018-09-20T19:12:24.160+08:00] WARN: [TID: N/A] [DiscoveryClient-InstanceInfoReplicator-0] InstanceInfoReplicator.java:98 - Ignoring onDemand update due to rate limiter

 

打开InstanceInfoReplicator.java查看了一下源代码,发现是由于单位分钟内向eureka server同步次数超过限制,导致有一个StatusChangeEvent被忽略掉了。

从上面的警告日志也可以看出一些端倪,就是在很短的时间内,收到了2个StatusChangeEvent事件,一次是从UP-》DOWN,一次是从DOWN->UP。apollo自身也使用了eureka,所以apollo也处理了这个事件两次(第一条和第二条日志,在Apollo-Config线程组中).然后springcloud内部也处理了这个事件(在DiscoveryClient-InstanceInfoReplicator线程组中),但是只处理了一个从UP->DOWN的事件,另外一个从DOWN->UP的事件由于rate limiter限制导致被忽略了,最终导致这个instance的health永远是down状态。

以下是处理statusChangeEvent的代码片段,来自com.netflix.discovery.DiscoveryClient#initScheduledTasks方法:

 private void initScheduledTasks() {
        if (clientConfig.shouldFetchRegistry()) {
            // registry cache refresh timer
            int registryFetchIntervalSeconds = clientConfig.getRegistryFetchIntervalSeconds();
            int expBackOffBound = clientConfig.getCacheRefreshExecutorExponentialBackOffBound();
            scheduler.schedule(
                    new TimedSupervisorTask(
                            "cacheRefresh",
                            scheduler,
                            cacheRefreshExecutor,
                            registryFetchIntervalSeconds,
                            TimeUnit.SECONDS,
                            expBackOffBound,
                            new CacheRefreshThread()
                    ),
                    registryFetchIntervalSeconds, TimeUnit.SECONDS);
        }

        if (clientConfig.shouldRegisterWithEureka()) {
            int renewalIntervalInSecs = instanceInfo.getLeaseInfo().getRenewalIntervalInSecs();
            int expBackOffBound = clientConfig.getHeartbeatExecutorExponentialBackOffBound();
            logger.info("Starting heartbeat executor: " + "renew interval is: " + renewalIntervalInSecs);

            // Heartbeat timer
            scheduler.schedule(
                    new TimedSupervisorTask(
                            "heartbeat",
                            scheduler,
                            heartbeatExecutor,
                            renewalIntervalInSecs,
                            TimeUnit.SECONDS,
                            expBackOffBound,
                            new HeartbeatThread()
                    ),
                    renewalIntervalInSecs, TimeUnit.SECONDS);

            // InstanceInfo replicator
            instanceInfoReplicator = new InstanceInfoReplicator(
                    this,
                    instanceInfo,
                    clientConfig.getInstanceInfoReplicationIntervalSeconds(),
                    2); // burstSize

            statusChangeListener = new ApplicationInfoManager.StatusChangeListener() {
                @Override
                public String getId() {
                    return "statusChangeListener";
                }

                @Override
                public void notify(StatusChangeEvent statusChangeEvent) {
                    if (InstanceStatus.DOWN == statusChangeEvent.getStatus() ||
                            InstanceStatus.DOWN == statusChangeEvent.getPreviousStatus()) {
                        // log at warn level if DOWN was involved
                        logger.warn("Saw local status change event {}", statusChangeEvent);
                    } else {
                        logger.info("Saw local status change event {}", statusChangeEvent);
                    }
                    instanceInfoReplicator.onDemandUpdate();
                }
            };

            if (clientConfig.shouldOnDemandUpdateStatusChange()) {
                applicationInfoManager.registerStatusChangeListener(statusChangeListener);
            }

            instanceInfoReplicator.start(clientConfig.getInitialInstanceInfoReplicationIntervalSeconds());
        } else {
            logger.info("Not registering with Eureka server per configuration");
        }
    }

中的StatusChangeListener的notify方法。里面调用了instanceInfoReplicator.onDemandUpdate()方法,尝试通过令牌桶算法来限流:

 public boolean onDemandUpdate() {
        if (rateLimiter.acquire(burstSize, allowedRatePerMinute)) {
            if (!scheduler.isShutdown()) {
                scheduler.submit(new Runnable() {
                    @Override
                    public void run() {
                        logger.debug("Executing on-demand update of local InstanceInfo");
    
                        Future latestPeriodic = scheduledPeriodicRef.get();
                        if (latestPeriodic != null && !latestPeriodic.isDone()) {
                            logger.debug("Canceling the latest scheduled update, it will be rescheduled at the end of on demand update");
                            latestPeriodic.cancel(false);
                        }
    
                        InstanceInfoReplicator.this.run();
                    }
                });
                return true;
            } else {
                logger.warn("Ignoring onDemand update due to stopped scheduler");
                return false;
            }
        } else {
            logger.warn("Ignoring onDemand update due to rate limiter");
            return false;
        }
    }

警告日志中的最后一行日志就是来自这个方法。

源码里面burstSize写死了等于2,没法改变,allowedRatePerMinute可以通过:

eureka.client.instance-info-replication-interval-seconds

参数来改变,默认是30秒同步一次。

每分钟限制同步多少次是在InstanceInfoReplicator类的构造函数中计算的:

    InstanceInfoReplicator(DiscoveryClient discoveryClient, InstanceInfo instanceInfo, int replicationIntervalSeconds, int burstSize) {
        this.discoveryClient = discoveryClient;
        this.instanceInfo = instanceInfo;
        this.scheduler = Executors.newScheduledThreadPool(1,
                new ThreadFactoryBuilder()
                        .setNameFormat("DiscoveryClient-InstanceInfoReplicator-%d")
                        .setDaemon(true)
                        .build());

        this.scheduledPeriodicRef = new AtomicReference<Future>();

        this.started = new AtomicBoolean(false);
        this.rateLimiter = new RateLimiter(TimeUnit.MINUTES);
        this.replicationIntervalSeconds = replicationIntervalSeconds;
        this.burstSize = burstSize;

        this.allowedRatePerMinute = 60 * this.burstSize / this.replicationIntervalSeconds;
        logger.info("InstanceInfoReplicator onDemand update allowed rate per min is {}", allowedRatePerMinute);
    }

this.allowedRatePerMinute = 60* this.burstSize / this.replicationIntervalSeconds ,默认为: 60*2/30=4。所以导致了第四个StatusChangedEvent被丢弃了。

那我建议在和apollo一起使用时,将eureka.client.instance-info-replication-interval-seconds改成24,即每分钟最多可以执行60*2/24=5次复制instance信息的请求,这样就可能避免该问题的发生。

 

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐