Cosmos Java SDK 抛出异常 x-ms-substatus=10002

Cosmos Java SDK throwing exceptions with x-ms-substatus=10002

我们正在使用 Cosmos DB 和 Spring 数据

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-spring-data-cosmos</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-cosmos</artifactId>
            <version>4.8.0</version>
        </dependency>

对于某些请求,(间歇性地),我们看到一些错误,例如 -

1.未能更新项目;嵌套异常是 CosmosException

2。找不到物品;嵌套异常是 CosmosException:进行查询的项目应该可用。

3。 DocumentProducer - 意外失败

{userAgent=azsdk-java-cosmos/4.8.0 Linux/5.3.0-1035-azure JRE/1.8.0_291, error=null, resourceAddress='null', requestUri='null', statusCode=0, message=null, {"userAgent":"azsdk-java-cosmos/4.8.0 Linux/5.3.0-1035-azure JRE/1.8.0_291","requestLatencyInMs":60073,"requestStartTimeUTC":"2021-06-25T05:31:17.150Z","requestEndTimeUTC":"2021-06-25T05:32:17.223Z",
"connectionMode":"GATEWAY","responseStatisticsList":[],"supplementalResponseStatisticsList":[],"addressResolutionStatistics":{},"regionsContacted":[],"retryContext":{"retryCount":0,"statusAndSubStatusCodes":null,"retryLatency":0},"metadataDiagnosticsContext":{"metadataDiagnosticList":null},
"serializationDiagnosticsContext":{"serializationDiagnosticsList":[{"serializationType":"ITEM_SERIALIZATION","startTimeUTC":"2021-06-25T05:31:17.150Z","endTimeUTC":"2021-06-25T05:31:17.150Z","durationInMicroSec":0},{"serializationType":"PARTITION_KEY_FETCH_SERIALIZATION","startTimeUTC":"2021-06-25T05:31:17.150Z","endTimeUTC":"2021-06-25T05:31:17.150Z","durationInMicroSec":0}]},
"gatewayStatistics":{"sessionToken":null,"operationType":"Upsert","statusCode":0,"subStatusCode":10002,"requestCharge":null,"requestTimeline":null},"systemInformation":{"usedMemory":"264619 KB","availableMemory":"3789909 KB","systemCpuLoad":"(2021-06-25T05:31:50.445Z 2.0%), (2021-06-25T05:31:55.445Z 2.0%), 
(2021-06-25T05:32:00.445Z 2.0%), (2021-06-25T05:32:05.445Z 2.0%), (2021-06-25T05:32:10.445Z 2.0%), (2021-06-25T05:32:15.445Z 2.0%)"},"clientCfgs":{"id":0,"numberOfClients":2,"connCfg":{"rntbd":null,"gw":"(cps:1000, rto:PT5S, icto:PT1M, p:false)","other":"(ed: true, cs: false)"},
"consistencyCfg":"(consistency: null, mm: true, prgns: [southcentralus,westus])"}}, 

causeInfo=[class: class io.netty.handler.timeout.ReadTimeoutException, message: null], responseHeaders={x-ms-substatus=10002}, requestHeaders=[Accept=application/json, x-ms-date=Fri, 25 Jun 2021 05:31:17 GMT, x-ms-documentdb-partitionkey=["xxxxx"], x-ms-documentdb-is-upsert=true, Content-Type=application/json]}

Stacktrace 显示的信息不多,如果需要我可以附在这里。在所有这些情况下,我们在日志中只看到一个响应 header -

responseHeaders={x-ms-substatus=10002}

我们对此有以下问题 -

注:。我们没有使用 Gremlin API 并以网关模式连接并且我们看到这些错误非常少,可能少于 Cosmos 请求总数的 0.05%

子状态 10002 与 HTTP 超时有关。由于您处于网关模式,因此您的操作通过 HTTP 执行是有道理的。

在设计或编写分布式应用程序时,您始终需要考虑超时问题。 SDK 确实会在超时时重试(基于 https://docs.microsoft.com/azure/cosmos-db/troubleshoot-java-sdk-v4-sql#retry-logic-),但应用程序应该始终有一些超时重试,因为它们可能由于各种原因(网络故障、零星资源争用)而发生。

您提到您看到这种情况不到总请求的 0.05%,这还不错,但值得检查指南中有关 Java 超时的区域:https://docs.microsoft.com/azure/cosmos-db/troubleshoot-request-timeout-java-sdk-v4-sql

服务端延迟峰值也可能导致超时(这不太可能发生,但有可能发生),这是您提到的支持团队可能会提供帮助的一个方面。从诊断来看,您的请求超时似乎是 5 秒,因此任何延迟超过 5 秒的网关请求都会生成一个 .