内存泄漏Bug现场

一个做BI数据展示的服务在一个晚上重启了5次,由于是通过k8s容器编排,服务挂了以后会自动重启,所以服务还能继续提供服务。

第一时间先上日志系统查看错误日志,发现如下报错:

java.lang.OutOfMemoryError    ERROR    java.lang.OutOfMemoryError: unable to create new native thread 
at java.lang.Thread.start0(Native Method) 
at java.lang.Thread.start(Thread.java:717) 
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.start(CloseableHttpAsyncClientBase.java:83) 
at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:190) 
at com.xiaohongshu.fls.jbds.handler.JbdsImpl.get_realtime_trend_data_from_es(JbdsImpl.java:637) 
at com.xiaohongshu.fls.jbds.handler.JbdsImpl.get_seller_trend(JbdsImpl.java:316) 
at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$FastClassBySpringCGLIB$$bd2466f7.invoke(<generated>) 
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) 
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:738) 
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) 
at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:85) 
at com.xiaohongshu.fls.jbds.aspect.RpcMethodExceptionAspect.requestControllerLog(RpcMethodExceptionAspect.java:33) 
at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:498) 
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:629) 
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:618) 
at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70) 
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) 
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92) 
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) 
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673) 
at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$EnhancerBySpringCGLIB$$d374b954.get_seller_trend(<generated>) 
at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:498) 
at com.xiaohongshu.infra.rpc.core.ThriftServerBaseProxy.call(ThriftServerBaseProxy.java:119) 
at com.xiaohongshu.infra.rpc.core.ThriftServerCGlibProxy.intercept(ThriftServerCGlibProxy.java:27) 
at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$EnhancerByCGLIB$$19473b8f.get_seller_trend(<generated>) 
at com.xiaohongshu.fls.rpc.jbds.JBusinessDataService$Processor$get_seller_trend.getResult(JBusinessDataService.java:1450) 
at com.xiaohongshu.fls.rpc.jbds.JBusinessDataService$Processor$get_seller_trend.getResult(JBusinessDataService.java:1435) 
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) 
at org.apache.thrift.server.Invocation.run(Invocation.java:18) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 

发现是OOM的错误,并且有unable to create new native thread的错误信息,笔者的第一直觉是创建了大量线程从而导致堆外内存空间不足。

随即去CAT监控系统上去查看线程的活跃情况,如下图:

alt text

发现服务的活跃线程数达到了1W左右,这显然有问题。服务本身是一个Thrift Server,开起的worker线程数为200,再加上一些其他IO线程,总的线程数不会超过300(服务内自己没有显示创建线程或者使用线程池)。

查找线索

笔者登录到线上的Docker实例上,通过jmap -histo:live pid命令,查看JVM中存活的对象,输出如下内容:

num     #instances         #bytes  class name
----------------------------------------------
   1:         19303      656995672  [B
   2:        116670       13942144  [C
   3:         15645        5298648  [I
   4:         10098        3796848  java.lang.Thread
   5:         75817        3639216  java.util.HashMap
   6:        113889        2733336  java.lang.String
   7:         27521        2421848  java.lang.reflect.Method
   8:         68662        2197184  java.util.concurrent.ConcurrentHashMap$Node
   9:         10041        1767032  [J
  10:           372        1525568  [Ljava.nio.ByteBuffer;
  11:         36807        1472280  java.util.LinkedHashMap$Entry
  12:         16504        1352488  [Ljava.util.HashMap$Node;
  13:         40687        1301984  java.lang.ThreadLocal$ThreadLocalMap$Entry
  14:         11495        1277344  java.lang.Class
  15:         29903        1196120  java.util.WeakHashMap$Entry
  16:         70474        1127584  java.lang.Object
  17:         15346         918592  [Ljava.lang.Object;
  18:         51556         824896  java.util.HashSet
  19:         25528         816896  java.util.HashMap$Node
  20:          9989         812176  [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
  21:          9661         792032  [Ljava.util.WeakHashMap$Entry;
  22:          9504         760320  org.apache.http.impl.nio.reactor.BaseIOReactor
  23:         30551         733224  java.util.concurrent.ConcurrentLinkedQueue$Node
  24:          9914         713808  sun.nio.ch.EPollArrayWrapper
  25:          9914         713808  sun.nio.ch.EPollSelectorImpl
  26:         29682         712368  java.util.concurrent.ConcurrentLinkedQueue
  27:          3525         694672  [Ljava.util.concurrent.ConcurrentHashMap$Node;
  28:         15664         501248  java.lang.ref.WeakReference
  29:          8364         468384  java.util.LinkedHashMap
  30:          9656         463488  java.util.WeakHashMap
  31:         20064         436536  [Ljava.lang.Class;
  32:          9267         370680  java.security.AccessControlContext
  33:          4856         349632  java.lang.reflect.Field
  34:         10278         328896  java.lang.ref.ReferenceQueue
  35:          9914         317248  sun.nio.ch.AllocatedNativeObject
  36:          4955         317120  java.util.concurrent.ConcurrentHashMap
  37:          7597         303880  java.lang.ref.SoftReference
  38:          9245         293568  [Ljava.security.ProtectionDomain;
  39:         11484         275616  java.util.ArrayList
  40:          3330         239760  org.springframework.core.annotation.AnnotationAttributes
  41:          9989         239736  java.lang.ThreadLocal$ThreadLocalMap
  42:          9951         238824  java.util.Collections$SynchronizedSet
  43:          9922         238128  java.util.BitSet
  44:          9504         228096  org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker
  45:          5460         207944  [Ljava.lang.String;
  46:          2459         196720  java.lang.reflect.Constructor
  47:         12103         193648  java.util.HashMap$KeySet
  48:          8003         192072  java.beans.MethodRef
  49:          3657         175536  org.aspectj.weaver.reflect.ShadowMatchImpl
  50:         10339         165424  java.util.concurrent.atomic.AtomicBoolean
  51:         10281         164496  java.lang.ref.ReferenceQueue$Lock
  52:         10203         163248  java.util.Collections$UnmodifiableSet
  53:          2913         163128  java.beans.MethodDescriptor
  54:          2547         161536  [Ljava.lang.reflect.Method;
  55:          9914         158624  java.nio.channels.spi.AbstractSelector$1
  56:          9913         158608  sun.nio.ch.Util$3
  57:          5481         131544  org.springframework.core.MethodClassKey
  58:          3707         118624  java.util.LinkedList
  59:          3637         116384  org.aspectj.weaver.patterns.ExposedState
  60:          2047         114632  java.lang.Class$ReflectionData
  61:          1155         110880  org.springframework.beans.GenericTypeAwarePropertyDescriptor
  62:           399         109984  [Ljava.lang.Thread;
  63:          4387         105288  sun.reflect.generics.tree.SimpleClassTypeSignature
  64:          1388          99936  java.beans.PropertyDescriptor
  65:          4087          90560  [Ljava.lang.reflect.Type;
  66:          5361          85776  org.springframework.core.annotation.AnnotationUtils$DefaultValueHolder
  67:          1296          82944  io.netty.buffer.PoolSubpage
  68:          4387          82520  [Lsun.reflect.generics.tree.TypeArgument;
  69:           674          75488  org.springframework.boot.loader.jar.JarEntry
  70:          1848          73920  java.util.TreeMap$Entry
  71:          3061          73464  sun.reflect.annotation.AnnotationInvocationHandler
  72:          2737          65688  java.util.LinkedList$Node
  73:          3989          63824  sun.reflect.generics.tree.ClassTypeSignature
  74:           858          61776  org.apache.ibatis.mapping.ResultMapping
  75:           832          53248  org.springframework.core.MethodParameter
  76:            78          51168  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  77:          2120          50880  java.util.Collections$UnmodifiableRandomAccessList
  78:          3165          50640  java.util.LinkedHashMap$LinkedKeySet
  79:          2419          49728  [Lsun.reflect.generics.tree.FieldTypeSignature;

alt text

标红的这个类org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker非常值得怀疑,有9504个实例,从类名看是一个实现了Reactor模式的http客户端,了解Reactor模式的同学都知道,其包含了一个accept线程处理连接事件,NIO bounding的线程处理readwrite事件,和Mworker线程处理业务逻辑。

进一步查找

接下来需要查找哪里使用了这个http客户端,在build.gradle的文件里看到依赖了EShttp client

group: 'org.elasticsearch.client', name: 'elasticsearch-rest-client', version: '6.3.2'

通过全局搜索,发现了如下的代码:

private String get_realtime_data_from_es (String seller_id) throws IOException{
    Date current_date = new Date( );
    SimpleDateFormat sdf = new SimpleDateFormat ("yyyyMMdd");
    String date=sdf.format(current_date);
    RestClient restClient = RestClient.builder(
            new HttpHost("xxxxxxxxx", 9200, "http"),
            new HttpHost("xxxxxxxxx", 9201, "http")).build();
    org.elasticsearch.client.Response response = restClient.performRequest("GET", "/seller/"+date+"/"+seller_id);
    restClient.close();
    return EntityUtils.toString(response.getEntity());
}

笔者发现这里调用ESRestClient的非常可疑,居然是每次调用创建一个对象,而不是使用单例模式。

使用过RestTemplate或者HttpClient的同学都知道,Http客户端一般都是通过单例的方式注入到Spring容器中,客户端都是线程安全的,多线程情况下不会出现串包的情况。(具体线程安全的实现机制,可参考笔者之前的博客:《Http请求连接池-HttpClient的AbstractConnPool源码分析》,地址:https://segmentfault.com/a/11...

源码探究

直接进到RestClientBuilder类的build()方法中一探究竟,代码如下:

public RestClient build() {
    if (failureListener == null) {
        failureListener = new RestClient.FailureListener();
    }
    CloseableHttpAsyncClient httpClient = AccessController.doPrivileged(new PrivilegedAction<CloseableHttpAsyncClient>() {
        @Override
        public CloseableHttpAsyncClient run() {
            return createHttpClient();
        }
    });
    RestClient restClient = new RestClient(httpClient, maxRetryTimeout, defaultHeaders, hosts, pathPrefix, failureListener);
    httpClient.start();
    return restClient;
}

build方法是实际创建RestClient的地方,设置了超时时间、host等参数。再点进start方法,他是抽象类CloseableHttpAsyncClient的一个抽象方法,具体实现在实现类CloseableHttpAsyncClientBase中,如下:

@Override
public void start() {
    if (this.status.compareAndSet(Status.INACTIVE, Status.ACTIVE)) {
        if (this.reactorThread != null) {
            this.reactorThread.start();
        }
    }
}

其调用了成员变量reactorThreadstart方法,而成员变量reactorThread 是一个Thread类,它在构造方法中初始化,如下:

public CloseableHttpAsyncClientBase(
        final NHttpClientConnectionManager connmgr,
        final ThreadFactory threadFactory,
        final NHttpClientEventHandler handler) {
    super();
    this.connmgr = connmgr;
    if (threadFactory != null && handler != null) {
        this.reactorThread = threadFactory.newThread(new Runnable() {

            @Override
            public void run() {
                try {
                    final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler);
                    connmgr.execute(ioEventDispatch);
                } catch (final Exception ex) {
                    log.error("I/O reactor terminated abnormally", ex);
                } finally {
                    status.set(Status.STOPPED);
                }
            }

        });
    } else {
        this.reactorThread = null;
    }
    this.status = new AtomicReference<Status>(Status.INACTIVE);
} 

可见,每创建一个CloseableHttpAsyncClient对象,就会创建一个reactorThread线程,而connmgr.execute(ioEventDispatch)是一个永久for循环执行的方法,所以线程的run方法不会主动退出,即,reactorThread线程不会销毁。

问题解决

找到bug所在后,结局方案很简单,只需将ESRestClient通过单例的方式注入到服务中,即可解决堆外内存泄漏的问题。

总结

查找bug时,需要从多个方面去定位,通过尽可能多的现场信息去定量分析,监控、线上机器、源码的查看都需要,深入下去,会有不少收获。

原文链接

https://segmentfault.com/a/11...

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐