一个奇怪的 Bug 追踪:当异常失去声音时#
最近,我们的团队一直在使用第三方 SDK 进行一些开发任务。一切运行顺利,直到生产环境突然开始抛出错误,我们遇到了一些真正奇怪的事情。我的一个队友来找我说,“代码只是执行到一半就停止了,跳过了整个部分!”
有问题的代码出人意料地简单:
try {
log.info("initiate client with conf: {}", conf);
SDKClient client = new SDKClient(conf);
client.init();
log.info("client initiated");
} catch (Exception e) {
log.error("initiate client failed", e);
}
log.info("start to manipulate...");
我们发现客户端实际上没有成功初始化,导致所有后续业务逻辑失败。当我们检查日志时,这是我们发现的:
initiate client with conf: xxxxx
start to manipulate...
这正是我队友所说的"代码跳来跳去"的意思。我们在日志中从未看到 client initiated 或 initiate client failed - 它直接跳到了 start to manipulate...
使用 Arthas 进行侦探工作#
对于熟悉我们设置的人来说,我们在 k8s + Docker 上运行,每个镜像都内置了 Arthas,使用 Java 16 并启用 JFR。我们的日志包括跟踪信息,并通过 ELK Agent 拉取到集中式日志服务器。
由于 SDK 需要 IP 白名单才能远程访问,我们无法在本地直接测试他们的生产环境。我们对他们的测试环境的本地测试工作得很好。所以我们不得不依赖 Arthas 进行调查。
首先,让我们使用 jad 命令验证生产代码是否与我们本地看到的匹配:
jad fully.qualified.class.name
反编译后,我们确认代码与我们的源代码相同。
接下来,让我们观察实际的执行流程:
trace fully.qualified.class.name method
重新执行方法并检查跟踪后,我们发现初始化期间确实抛出了异常:
# 省略无关细节
+---[min=0.010174ms,max=0.01184ms,total=0.022014ms,count=2] org.apache.logging.log4j.Logger:info() #130
+---[min=599.388978ms,max=630.23967ms,total=1229.628648ms,count=2] com.dasha13.sdk.SDKClient:<init>() #131
+---[min=203.617545ms,max=221.785512ms,total=425.403057ms,count=2] com.dasha13.sdk.SDKClient:init() #132 [throws Exception,2]
+---[min=0.034798ms,max=0.084505ms,total=0.119303ms,count=2] org.apache.logging.log4j.Logger:error() #136
+---[min=0.010174ms,max=0.01184ms,total=0.022014ms,count=2] org.apache.logging.log4j.Logger:info() #138
但这里有个谜:为什么这个异常没有被记录? 让我们使用深度为 2 的 watch 命令进行更深入的研究,以捕获堆栈跟踪和消息:
watch com.dasha13.sdk.SDKClient init {throwExp} -x 2
然而,我们只得到了看起来像消息的输出:
method=com.dasha13.sdk.SDKClient init location=AtExceptionExit
ts=2021-08-10 02:58:15; [cost=131.20209ms] result=ERROR DATA!!! object class: class java.util.ArrayList, exception class: class com.google.common.util.concurrent.UncheckedExecutionException, exception message: java.lang.IllegalArgumentException
这很奇怪。通常,深度设置为 2 时,如果抛出异常,输出应该包括异常消息和堆栈跟踪。这里发生了什么?让我们尝试分别获取堆栈跟踪和消息。
首先,堆栈跟踪:
watch com.dasha13.sdk.SDKClient init {throwExp.getStackTrace()} -x 2
重新执行有问题的方法,堆栈跟踪正常输出。查看它,问题似乎与 Google 的依赖注入框架 Guice(类似于 Spring)加载 bean 失败有关:
ts=2021-08-10 03:03:37; [cost=146.644563ms] result=@ArrayList[
@StackTraceElement[][
@StackTraceElement[com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1025)],
@StackTraceElement[com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051)],
@StackTraceElement[com.dasha13.sdk.SDKClient.init(SDKClient.java:482)],
# 其他堆栈跟踪省略
现在让我们检查异常消息:
watch com.dasha13.sdk.SDKClient init {throwExp.getMessage()} -x 2
重新执行方法,我们发现 watch 命令失败了:
watch failed, condition is: null, express is: {throwExp.getMessage()}, com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException, visit /app/arthas/arthas.log for more details.
按照建议,我们检查了 arthas 日志,发现了这个异常堆栈跟踪:
2021-08-10 03:07:11 [XNIO-2 task-3] ERROR c.t.a.c.command.express.OgnlExpress -Error during evaluating the expression:
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)
at com.google.inject.internal.util.StackTraceElements.forMember(StackTraceElements.java:66)
at com.google.inject.internal.Errors.formatSource(Errors.java:806)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:839)
at com.google.inject.internal.Errors.formatSource(Errors.java:800)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.format(Errors.java:584)
at com.google.inject.ProvisionException.getMessage(ProvisionException.java:60)
cause by: MethodNotFoundException: Method not found: class com.google.common.xxxxxxxxx
啊哈!我们发现 ProvisionException 的 getMessage() 方法正在抛出异常 - 异常的 getMessage() 导致了异常! 查看根本原因,我们确定了 Guava 和 Guice 之间的版本不兼容。根本问题是第三方接口超时,导致初始化失败。这些异常被包装在 ProvisionException 中,但由于 ProvisionException 的 getMessage 依赖于 Guava Cache 来缓存异常信息,而我们项目的 Guava 版本与 Guice 版本不兼容,某些方法不存在。这导致 ProvisionException 的 getMessage 也抛出异常。以前这工作正常,因为第三方接口在初始化期间没有遇到超时。
为什么异常信息没有被记录#
我们使用 log4j2 的异步日志配置,将异常作为最后一个参数传递给日志方法。在正常情况下,这将输出异常的消息和堆栈跟踪。但正如我们上面发现的,获取消息本身正在抛出异常。Log4j 使用生产者-消费者架构处理日志事件。在这里,消费者尝试获取异常的消息和堆栈跟踪,但在获取消息时遇到异常。对于 Log4j2 异步日志,当发生异常时,原始日志事件被完全丢弃,异常被输出到 StatusLogger(这基本上进入标准错误输出)。这对应于 log4j 源代码:
private void tryCallAppender(final LogEvent event) {
try {
//调用 appender 输出日志
appender.append(event);
} catch (final RuntimeException error) {
//处理 RuntimeException
handleAppenderError(event, error);
} catch (final Exception error) {
//处理其他 Exceptions
handleAppenderError(event, new AppenderLoggingException(error));
}
}
private void handleAppenderError(final LogEvent event, final RuntimeException ex) {
appender.getHandler().error(createErrorMsg("An exception occurred processing Appender "), event, ex);
if (!appender.ignoreExceptions()) {
throw ex;
}
}
ErrorHandler 通常是默认实现 DefaultErrorHandler,它输出到 StatusLogger:
private static final Logger LOGGER = StatusLogger.getLogger();
public void error(final String msg, final LogEvent event, final Throwable t) {
final long current = System.nanoTime();
if (current - lastException > EXCEPTION_INTERVAL || exceptionCount++ < MAX_EXCEPTIONS) {
LOGGER.error(msg, t);
}
lastException = current;
if (!appender.ignoreExceptions() && t != null && !(t instanceof AppenderLoggingException)) {
throw new AppenderLoggingException(msg, t);
}
}
StatusLogger 基本上输出到标准错误 System.err:
this.logger = new SimpleLogger("StatusLogger", Level.ERROR, false, true, showDateTime, false,
dateFormat, messageFactory, PROPS,
//标准错误输出
System.err);
在我们的部署架构中,标准错误输出到一个相当隐蔽的位置,基本上没有人检查,所以我们错过了它。当我们最终查看标准错误输出时,我们确实找到了异常:
2021-08-10 03:30:29,810 Log4j2-TF-10-AsyncLoggerConfig-3 ERROR An exception occurred processing Appender file com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)
at com.google.inject.internal.util.StackTraceElements.forMember(StackTraceElements.java:66)
at com.google.inject.internal.Errors.formatSource(Errors.java:806)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:839)
at com.google.inject.internal.Errors.formatSource(Errors.java:800)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.format(Errors.java:584)
at com.google.inject.ProvisionException.getMessage(ProvisionException.java:60)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:105)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:93)
at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:629)
at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:63)
at org.springframework.boot.logging.log4j2.ExtendedWhitespaceThrowablePatternConverter.format(ExtendedWhitespaceThrowablePatternConverter.java:50)
at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38)
at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:345)
at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:244)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:229)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:59)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:197)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.IllegalArgumentException
at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
at com.google.inject.internal.util.LineNumbers.<init>(LineNumbers.java:65)
at com.google.inject.internal.util.StackTraceElements$1.load(StackTraceElements.java:43)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
... 41 more
此外,基于 Appender 的 ignoreExceptions 配置(默认为 true),它确定调用位置是否应该抛出异常。然而,这是针对同步日志。异步日志将异常抛出到 Disruptor 的异常处理器,Log4j2 的 Disruptor 异常处理也将异常输出到 System.err(标准错误输出)。默认情况下,不抛出异常,因为对于同步日志,没有人希望业务逻辑仅仅因为日志问题而失败。对于异步日志,由于之前的处理已经输出到标准错误,不需要冗余。

