Google Cloud Natural Language API:analyzeSyntax API 是否有一个令牌参数不在 partOfSpeech 结果中包含“*_UNKNOWN”值属性

Google Cloud Natural Language API: Does analyzeSyntax API have a param for tokens to not include "*_UNKNOWN" value attributes in partOfSpeech result

我想知道 API 端点是否有任何方式允许 analyzeSyntax API response JSON to not include sub-attributes of partOfSpeech dictionaries if they are *_UNKNOWN? When looking at details around the document input, I can't find any way to limit the response document contents of partOfSpeech

这是只有清理数据时才会处理的东西吗,post-response?

示例查询 per API docs here 在名为 request.json 的文件中:

{
  "encodingType": "UTF8",
  "document": {
    "type": "PLAIN_TEXT",
    "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.  Sundar Pichai said in his keynote that users love their new Android phones."
  }
}

执行的命令:

curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" \
  -s \
  -X POST \
  -H "Content-Type: application/json" \
  --data-binary @request.json > response.json

响应示例:

{
  "sentences": [
    {
      "text": {
        "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
        "beginOffset": 0
      }
    },
    {
      "text": {
        "content": "Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 105
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Google",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 7,
        "label": "NSUBJ"
      },
      "lemma": "Google"
    },
    {
      "text": {
        "content": ",",
        "beginOffset": 6
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 0,
        "label": "P"
      },
      "lemma": ","
    },
...
...

此响应 JSON 是 819 行,其中 314 行(响应的近 40%!)是 *_UNKNOWN partOfSpeech 属性的值。因此,完全没用,但会显着增加响应中的数据量。

文档似乎没有提供对此有帮助的参数。我是不是遗漏了什么,或者这个 API 不支持在 *_UNKNOWN 时删除这些键的论点?这是只能管理post-response with data cleaning吗?

如果我们查看 API 规范,我们最终会发现词性实际上是枚举(enumerations)。比如我们发现Gender可以是:

  • GENDER_UNKNOWN
  • 女性气质
  • 阳刚
  • 中性

使 REST API 调用发送和接收 JSON 有效载荷,JSON 对枚举的抽象是它们的值是扩展字符串。但是,REST 和 JSON 并不是发出 GCP 服务请求的唯一协议。还可以进行 gRPC 调用。当一个人使用 gRPC 时,传输的协议是 protocol buffer。来自 Google 的语言绑定允许您使用 gRPC 进行服务调用,而不必因学习该技术而分心。 gRPC 的价值在于消息更小更快。

我没有看到在 API 级别提供传输压缩的机制(例如在使用 REST 时要求字段不包含在 JSON 响应中)。

另请参阅: