如何为 Google 自然语言实体 api 响应计算 beginoffset？

Question

我正在使用 Google 的自然语言 analyzeEntities api 并且在响应中，有一个嵌套的 EntityMention.TextSpan 对象，有 2 个字段：内容和开始偏移量。我想利用 beginOffset 进行进一步分析。所以我试图映射原始文本中的单词索引并将它们与 beginOffset 进行比较，但我注意到索引不同。

我正在使用一种相当天真的方法来构建这个索引：

const msg = "it will cost you 0 - 0,. test. Alexander. How are you?"
let index = 0
msg.split(" ").forEach(part => {
  console.log(part + ":"  + index)
  index = index + part.length + 1 // + 1 for the split on space
})

结果是：

it:0
will:3
cost:8
you:13
0:17
-:22
0,.:24
test.:31
Alexander.:37
How:48
are:52
you?:56

我从 analyzeEntities api 得到的结果是：

gcloud ml language analyze-entities --content="it will cost you 0 - 0,. test. Alexander. How are you?"                
{
  "entities": [
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 23,
            "content": "test"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "test",
      "salience": 0.7828024,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 29,
            "content": "Alexander"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "Alexander",
      "salience": 0.2171976,
      "type": "PERSON"
    }
  ],
  "language": "en"
}

我了解非字母数字字符具有特殊含义和处理方式，我希望偏移量代表真实索引。

因为，不是解析查询文本的规则是什么，beginOffset是怎么计算的？

谢谢！

Answer 1

您可以控制请求中的编码（用于计算偏移量）。（编码类型：https://cloud.google.com/natural-language/docs/analyzing-entities#language-entities-string-protocol）。对于 python，您需要将其设置为 UTF32 (https://cloud.google.com/natural-language/docs/reference/rest/v1/EncodingType)。 gcloud 使用的是 UTF-8 编码，基本上可以为您提供 byte-level 偏移量。

Answer 2

看起来 $ 标志是这里的问题。

gcloud ml language analyze-entities --content="it will cost you $350 - $600,. test. Alexander. How are you?" 
{
  "entities": [
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 31,
            "content": "test"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "test",
      "salience": 0.7828024,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 37,
            "content": "Alexander"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "Alexander",
      "salience": 0.2171976,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 17,
            "content": "0"
          },
          "type": "TYPE_UNKNOWN"
        }
      ],
      "metadata": {
        "currency": "USD",
        "value": "350.000000"
      },
      "name": "0",
      "salience": 0.0,
      "type": "PRICE"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 24,
            "content": "0"
          },
          "type": "TYPE_UNKNOWN"
        }
      ],
      "metadata": {
        "currency": "USD",
        "value": "600.000000"
      },
      "name": "0",
      "salience": 0.0,
      "type": "PRICE"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 18,
            "content": "350"
          },
          "type": "TYPE_UNKNOWN"
        }
      ],
      "metadata": {
        "value": "350"
      },
      "name": "350",
      "salience": 0.0,
      "type": "NUMBER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 25,
            "content": "600"
          },
          "type": "TYPE_UNKNOWN"
        }
      ],
      "metadata": {
        "value": "600"
      },
      "name": "600",
      "salience": 0.0,
      "type": "NUMBER"
    }
  ],
  "language": "en"
}

如果您将 $ 符号更改为 #，它似乎会按预期工作。

gcloud ml language analyze-entities --content="it will cost you #350 - #600,. test. Alexander. How are you?" 
{
  "entities": [
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 31,
            "content": "test"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "test",
      "salience": 0.9085014,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 37,
            "content": "Alexander"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "Alexander",
      "salience": 0.09149864,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 18,
            "content": "350"
          },
          "type": "TYPE_UNKNOWN"
        }
      ],
      "metadata": {
        "value": "350"
      },
      "name": "350",
      "salience": 0.0,
      "type": "NUMBER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 25,
            "content": "600"
          },
          "type": "TYPE_UNKNOWN"
        }
      ],
      "metadata": {
        "value": "600"
      },
      "name": "600",
      "salience": 0.0,
      "type": "NUMBER"
    }
  ],
  "language": "en"
}

如何为 Google 自然语言实体 api 响应计算 beginoffset？

How is beginoffset calculated for the Google Natural language entities api response?

google-natural-language