将 Opencensus 与 FastAPI 一起使用时的递归日志记录问题

Recursive logging issue when using Opencensus with FastAPI

我在执行 Opencensus、登录 Python 和 FastAPI 时遇到问题。我想在 Azure 中记录对 Application Insights 的传入请求,因此我在 the Microsoft docs and this Github post:

之后的代码中添加了一个 FastAPI 中间件
propagator = TraceContextPropagator()

@app.middleware('http')
async def middleware_opencensus(request: Request, call_next):
    tracer = Tracer(
        span_context=propagator.from_headers(request.headers),
        exporter=AzureExporter(connection_string=os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']),
        sampler=AlwaysOnSampler(),
        propagator=propagator)

    with tracer.span('main') as span:
        span.span_kind = SpanKind.SERVER
        tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
        tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
        tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
        tracer.add_attribute_to_current_span(HTTP_ROUTE, request.url.path)
        tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))

        response = await call_next(request)
        tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)

    return response

这在 运行 本地时效果很好,并且对 api 的所有传入请求都记录到 Application Insights。然而,由于实施了 Opencensus,当部署在 Azure 上的容器实例中时,几天后(大约 3 天)出现了一个问题,看起来像是发生了一些递归日志记录问题(每秒 +30.000 条日志!),i.a .声明 Queue is full. Dropping telemetry,然后在经过几个小时的疯狂记录后最终崩溃:

我们定义日志处理程序的 logger.py 文件如下:

import logging.config
import os
import tqdm
from pathlib import Path
from opencensus.ext.azure.log_exporter import AzureLogHandler


class TqdmLoggingHandler(logging.Handler):
    """
        Class for enabling logging during a process with a tqdm progress bar.
        Using this handler logs will be put above the progress bar, pushing the
        process bar down instead of replacing it.
    """
    def __init__(self, level=logging.NOTSET):
        super().__init__(level)
        self.formatter = logging.Formatter(fmt='%(asctime)s <%(name)s> %(levelname)s: %(message)s',
                                           datefmt='%d-%m-%Y %H:%M:%S')

    def emit(self, record):
        try:
            msg = self.format(record)
            tqdm.tqdm.write(msg)
            self.flush()
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            self.handleError(record)


logging_conf_path = Path(__file__).parent
logging.config.fileConfig(logging_conf_path / 'logging.conf')

logger = logging.getLogger(__name__)
logger.addHandler(TqdmLoggingHandler(logging.DEBUG))  # Add tqdm handler to root logger to replace the stream handler
if os.getenv('APPLICATION_INSIGHTS_CONNECTION_STRING'):
    logger.addHandler(AzureLogHandler(connection_string=os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))

warning_level_loggers = ['urllib3', 'requests']
for lgr in warning_level_loggers:
    logging.getLogger(lgr).setLevel(logging.WARNING)

有没有人知道这个问题的可能原因,或者有人遇到过类似的问题?由于日志记录量过快,我不知道 'first' 错误日志是什么。

如果需要更多信息,请告诉我。

提前致谢!

我们决定重新审视这个问题,发现两个有用的线程描述了我们所看到的即使不完全相同也相似的行为:

如第二个线程中所述,Opencensus 似乎尝试向 AI 发送跟踪,如果失败,失败的日志将被批处理并在 15 秒(默认)内再次发送。这将无限期地持续下去,直到成功,可能会导致失败日志的巨大且看似递归的垃圾邮件。

Izchen in this comment介绍并提出的解决方案是针对这个问题设置enable_local_storage=False

另一个解决方案是迁移到 OpenTelemetry,它不应该包含这个潜在的问题,这是我们目前的解决方案 运行。 请记住,Opencensus 仍然是 Microsoft 官方支持的应用程序监控解决方案,而 OpenTelemetry 还很年轻。 OpenTelemetry 似乎得到了很多支持,并且越来越受欢迎。

至于 OpenTelemetry 的实施,我们执行了以下操作来跟踪我们的请求:

if os.getenv('APPLICATION_INSIGHTS_CONNECTION_STRING'):
    from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
    from opentelemetry import trace
    from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
    from opentelemetry.propagate import extract
    from opentelemetry.sdk.resources import SERVICE_NAME, SERVICE_NAMESPACE, SERVICE_INSTANCE_ID, Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor

    provider = TracerProvider()

    processor = BatchSpanProcessor(AzureMonitorTraceExporter.from_connection_string(
        os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

    FastAPIInstrumentor.instrument_app(app)

OpenTelemetry 支持许多自定义 Instrumentors,可用于创建跨度,例如 Requests PyMongo、Elastic、Redis 等。=> https://opentelemetry.io/registry/.

如果您想像上面的 OpenCensus 示例中那样编写您的自定义 tracers/spans,您可以尝试这样的事情:

# These come still from Opencensus for convenience
HTTP_HOST = COMMON_ATTRIBUTES['HTTP_HOST']
HTTP_METHOD = COMMON_ATTRIBUTES['HTTP_METHOD']
HTTP_PATH = COMMON_ATTRIBUTES['HTTP_PATH']
HTTP_ROUTE = COMMON_ATTRIBUTES['HTTP_ROUTE']
HTTP_URL = COMMON_ATTRIBUTES['HTTP_URL']
HTTP_STATUS_CODE = COMMON_ATTRIBUTES['HTTP_STATUS_CODE']

provider = TracerProvider()

processor = BatchSpanProcessor(AzureMonitorTraceExporter.from_connection_string(
        os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

@app.middleware('http')
async def middleware_opentelemetry(request: Request, call_next):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span('main',
                                      context=extract(request.headers),
                                      kind=trace.SpanKind.SERVER) as span:
        span.set_attributes({
            HTTP_HOST: request.url.hostname,
            HTTP_METHOD: request.method,
            HTTP_PATH: request.url.path,
            HTTP_ROUTE: request.url.path,
            HTTP_URL: str(request.url)
        })

        response = await call_next(request)
        span.set_attribute(HTTP_STATUS_CODE, response.status_code)

    return response

此解决方案不再需要我们 logger.py 配置中的 AzureLogHandler,因此已将其删除。

其他一些可能有用的来源: