Google Cloud Natural Language API：analyzeSyntax API 是否有一个令牌参数不在 partOfSpeech 结果中包含“*_UNKNOWN”值属性

Question

我想知道 API 端点是否有任何方式允许 analyzeSyntax API response JSON to not include sub-attributes of partOfSpeech dictionaries if they are *_UNKNOWN? When looking at details around the document input, I can't find any way to limit the response document contents of partOfSpeech。

这是只有清理数据时才会处理的东西吗，post-response？

示例查询 per API docs here 在名为 request.json 的文件中：

{
  "encodingType": "UTF8",
  "document": {
    "type": "PLAIN_TEXT",
    "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.  Sundar Pichai said in his keynote that users love their new Android phones."
  }
}

执行的命令：

curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" \
  -s \
  -X POST \
  -H "Content-Type: application/json" \
  --data-binary @request.json > response.json

响应示例：

{
  "sentences": [
    {
      "text": {
        "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
        "beginOffset": 0
      }
    },
    {
      "text": {
        "content": "Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 105
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Google",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 7,
        "label": "NSUBJ"
      },
      "lemma": "Google"
    },
    {
      "text": {
        "content": ",",
        "beginOffset": 6
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 0,
        "label": "P"
      },
      "lemma": ","
    },
...
...

此响应 JSON 是 819 行，其中 314 行（响应的近 40%！）是 *_UNKNOWN partOfSpeech 属性的值。因此，完全没用，但会显着增加响应中的数据量。

文档似乎没有提供对此有帮助的参数。我是不是遗漏了什么，或者这个 API 不支持在 *_UNKNOWN 时删除这些键的论点？这是只能管理post-response with data cleaning吗？

Answer 1

如果我们查看 API 规范，我们最终会发现词性实际上是枚举（enumerations）。比如我们发现Gender可以是：

GENDER_UNKNOWN
女性气质
阳刚
中性

使 REST API 调用发送和接收 JSON 有效载荷，JSON 对枚举的抽象是它们的值是扩展字符串。但是，REST 和 JSON 并不是发出 GCP 服务请求的唯一协议。还可以进行 gRPC 调用。当一个人使用 gRPC 时，传输的协议是 protocol buffer。来自 Google 的语言绑定允许您使用 gRPC 进行服务调用，而不必因学习该技术而分心。 gRPC 的价值在于消息更小更快。

我没有看到在 API 级别提供传输压缩的机制（例如在使用 REST 时要求字段不包含在 JSON 响应中）。

另请参阅：

Comparing sizes of protobuf vs json

Google Cloud Natural Language API：analyzeSyntax API 是否有一个令牌参数不在 partOfSpeech 结果中包含“*_UNKNOWN”值属性

Google Cloud Natural Language API: Does analyzeSyntax API have a param for tokens to not include "*_UNKNOWN" value attributes in partOfSpeech result

google-api

google-cloud-platform

google-natural-language