如何为 Google 自然语言实体 api 响应计算 beginoffset?
How is beginoffset calculated for the Google Natural language entities api response?
我正在使用 Google 的自然语言 analyzeEntities
api 并且在响应中,有一个嵌套的 EntityMention.TextSpan
对象,有 2 个字段:内容和开始偏移量。
我想利用 beginOffset 进行进一步分析。所以我试图映射原始文本中的单词索引并将它们与 beginOffset 进行比较,但我注意到索引不同。
我正在使用一种相当天真的方法来构建这个索引:
const msg = "it will cost you 0 - 0,. test. Alexander. How are you?"
let index = 0
msg.split(" ").forEach(part => {
console.log(part + ":" + index)
index = index + part.length + 1 // + 1 for the split on space
})
结果是:
it:0
will:3
cost:8
you:13
0:17
-:22
0,.:24
test.:31
Alexander.:37
How:48
are:52
you?:56
我从 analyzeEntities api 得到的结果是:
gcloud ml language analyze-entities --content="it will cost you 0 - 0,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 23,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.7828024,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 29,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.2171976,
"type": "PERSON"
}
],
"language": "en"
}
我了解非字母数字字符具有特殊含义和处理方式,我希望偏移量代表真实索引。
因为,不是解析查询文本的规则是什么,beginOffset是怎么计算的?
谢谢!
您可以控制请求中的编码(用于计算偏移量)。 (编码类型:https://cloud.google.com/natural-language/docs/analyzing-entities#language-entities-string-protocol)。
对于 python,您需要将其设置为 UTF32 (https://cloud.google.com/natural-language/docs/reference/rest/v1/EncodingType)。 gcloud 使用的是 UTF-8 编码,基本上可以为您提供 byte-level 偏移量。
看起来 $
标志是这里的问题。
gcloud ml language analyze-entities --content="it will cost you $350 - $600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 31,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.7828024,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 37,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.2171976,
"type": "PERSON"
},
{
"mentions": [
{
"text": {
"beginOffset": 17,
"content": "0"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"currency": "USD",
"value": "350.000000"
},
"name": "0",
"salience": 0.0,
"type": "PRICE"
},
{
"mentions": [
{
"text": {
"beginOffset": 24,
"content": "0"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"currency": "USD",
"value": "600.000000"
},
"name": "0",
"salience": 0.0,
"type": "PRICE"
},
{
"mentions": [
{
"text": {
"beginOffset": 18,
"content": "350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "350"
},
"name": "350",
"salience": 0.0,
"type": "NUMBER"
},
{
"mentions": [
{
"text": {
"beginOffset": 25,
"content": "600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "600"
},
"name": "600",
"salience": 0.0,
"type": "NUMBER"
}
],
"language": "en"
}
如果您将 $
符号更改为 #
,它似乎会按预期工作。
gcloud ml language analyze-entities --content="it will cost you #350 - #600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 31,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.9085014,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 37,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.09149864,
"type": "PERSON"
},
{
"mentions": [
{
"text": {
"beginOffset": 18,
"content": "350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "350"
},
"name": "350",
"salience": 0.0,
"type": "NUMBER"
},
{
"mentions": [
{
"text": {
"beginOffset": 25,
"content": "600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "600"
},
"name": "600",
"salience": 0.0,
"type": "NUMBER"
}
],
"language": "en"
}
我正在使用 Google 的自然语言 analyzeEntities
api 并且在响应中,有一个嵌套的 EntityMention.TextSpan
对象,有 2 个字段:内容和开始偏移量。
我想利用 beginOffset 进行进一步分析。所以我试图映射原始文本中的单词索引并将它们与 beginOffset 进行比较,但我注意到索引不同。
我正在使用一种相当天真的方法来构建这个索引:
const msg = "it will cost you 0 - 0,. test. Alexander. How are you?"
let index = 0
msg.split(" ").forEach(part => {
console.log(part + ":" + index)
index = index + part.length + 1 // + 1 for the split on space
})
结果是:
it:0
will:3
cost:8
you:13
0:17
-:22
0,.:24
test.:31
Alexander.:37
How:48
are:52
you?:56
我从 analyzeEntities api 得到的结果是:
gcloud ml language analyze-entities --content="it will cost you 0 - 0,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 23,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.7828024,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 29,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.2171976,
"type": "PERSON"
}
],
"language": "en"
}
我了解非字母数字字符具有特殊含义和处理方式,我希望偏移量代表真实索引。
因为,不是解析查询文本的规则是什么,beginOffset是怎么计算的?
谢谢!
您可以控制请求中的编码(用于计算偏移量)。 (编码类型:https://cloud.google.com/natural-language/docs/analyzing-entities#language-entities-string-protocol)。 对于 python,您需要将其设置为 UTF32 (https://cloud.google.com/natural-language/docs/reference/rest/v1/EncodingType)。 gcloud 使用的是 UTF-8 编码,基本上可以为您提供 byte-level 偏移量。
看起来 $
标志是这里的问题。
gcloud ml language analyze-entities --content="it will cost you $350 - $600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 31,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.7828024,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 37,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.2171976,
"type": "PERSON"
},
{
"mentions": [
{
"text": {
"beginOffset": 17,
"content": "0"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"currency": "USD",
"value": "350.000000"
},
"name": "0",
"salience": 0.0,
"type": "PRICE"
},
{
"mentions": [
{
"text": {
"beginOffset": 24,
"content": "0"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"currency": "USD",
"value": "600.000000"
},
"name": "0",
"salience": 0.0,
"type": "PRICE"
},
{
"mentions": [
{
"text": {
"beginOffset": 18,
"content": "350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "350"
},
"name": "350",
"salience": 0.0,
"type": "NUMBER"
},
{
"mentions": [
{
"text": {
"beginOffset": 25,
"content": "600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "600"
},
"name": "600",
"salience": 0.0,
"type": "NUMBER"
}
],
"language": "en"
}
如果您将 $
符号更改为 #
,它似乎会按预期工作。
gcloud ml language analyze-entities --content="it will cost you #350 - #600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 31,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.9085014,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 37,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.09149864,
"type": "PERSON"
},
{
"mentions": [
{
"text": {
"beginOffset": 18,
"content": "350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "350"
},
"name": "350",
"salience": 0.0,
"type": "NUMBER"
},
{
"mentions": [
{
"text": {
"beginOffset": 25,
"content": "600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "600"
},
"name": "600",
"salience": 0.0,
"type": "NUMBER"
}
],
"language": "en"
}