Google Cloud NL 实体识别器将单词组合在一起

Question

当尝试在长文本输入中查找实体时，Google Cloud 的自然语言程序将单词组合在一起，然后获取不正确的实体。这是我的程序：

def entity_recognizer(nouns):

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/superaitor/Downloads/link"
text = ""
for words in nouns:
    text += words + " "
client = language.LanguageServiceClient()

if isinstance(text, six.binary_type):
    text = text.decode('utf-8')

document = types.Document(
    content=text.encode('utf-8'),
    type=enums.Document.Type.PLAIN_TEXT)

encoding = enums.EncodingType.UTF32
if sys.maxunicode == 65535:
    encoding = enums.EncodingType.UTF16


entity = client.analyze_entities(document, encoding).entities
entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
               'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')

for entity in entity:
    #if entity_type[entity.type] is "PERSON":
    print(entity_type[entity.type])
    print(entity.name)

此处名词是单词列表。然后我把它变成一个字符串（我尝试了多种方法，都给出了相同的结果），但是程序输出如下：

PERSON
liberty secularism etching domain professor lecturer tutor royalty 
government adviser commissioner
OTHER
business view society economy
OTHER
business
OTHER
verge industrialization market system custom shift rationality
OTHER
family kingdom life drunkenness college student appearance income family 
brink poverty life writer variety attitude capitalism age process 
production factory system

关于如何解决这个问题的任何意见？

Answer 1

不按实体分类，我会直接使用Google默认类别，改变

entity = client.analyze_entities(document, encoding).entities

到

categories = client.classify_text(document).categories

并因此更新了代码。我根据this tutorial, further developed in github写了下面的示例代码。

def run_quickstart():
    # [START language_quickstart]
    # Imports the Google Cloud client library
    # [START migration_import]
    from google.cloud import language
    from google.cloud.language import enums
    from google.cloud.language import types
    # [END migration_import]

    # Instantiates a client
    # [START migration_client]
    client = language.LanguageServiceClient()
    # [END migration_client]

    # The text to analyze
    text = u'For its part, India has said it will raise taxes on 29 products imported from the US - including some agricultural goods, steel and iron products - in retaliation for the wide-ranging US tariffs.'
    document = types.Document(
        content=text,
        type=enums.Document.Type.PLAIN_TEXT)

    # Detects the sentiment of the text
    sentiment = client.analyze_sentiment(document=document).document_sentiment

    # Classify content categories
    categories = client.classify_text(document).categories

    # User category feedback
    for category in categories:
        print(u'=' * 20) 
        print(u'{:<16}: {}'.format('name', category.name))
        print(u'{:<16}: {}'.format('confidence', category.confidence))

    # User sentiment feedback
    print('Text: {}'.format(text))
    print('Sentiment: {}, {}'.format(sentiment.score, sentiment.magnitude))
    # [END language_quickstart]


if __name__ == '__main__':
    run_quickstart()

这个解决方案适合你吗？如果不是，为什么？

Answer 2

对于 analyze entities in a text，您可以使用文档中的示例，如下所示：

import argparse
import sys

from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import six

def entities_text(text):
    """Detects entities in the text."""
    client = language.LanguageServiceClient()

    if isinstance(text, six.binary_type):
        text = text.decode('utf-8')

    # Instantiates a plain text document.
    document = types.Document(
        content=text,
        type=enums.Document.Type.PLAIN_TEXT)

    # Detects entities in the document. You can also analyze HTML with:
    #   document.type == enums.Document.Type.HTML
    entities = client.analyze_entities(document).entities

    # entity types from enums.Entity.Type
    entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
                   'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')

    for entity in entities:
        print('=' * 20)
        print(u'{:<16}: {}'.format('name', entity.name))
        print(u'{:<16}: {}'.format('type', entity_type[entity.type]))
        print(u'{:<16}: {}'.format('metadata', entity.metadata))
        print(u'{:<16}: {}'.format('salience', entity.salience))
        print(u'{:<16}: {}'.format('wikipedia_url',
              entity.metadata.get('wikipedia_url', '-')))

entities_text("Donald Trump is president of United States of America")

这个样本的输出是：

====================
name            : Donald Trump
type            : PERSON
metadata        : <google.protobuf.pyext._message.ScalarMapContainer object at 0x7fd9d0125170>
salience        : 0.9564903974533081
wikipedia_url   : https://en.wikipedia.org/wiki/Donald_Trump
====================
name            : United States of America
type            : LOCATION
metadata        : <google.protobuf.pyext._message.ScalarMapContainer object at 0x7fd9d01252b0>
salience        : 0.04350961744785309
wikipedia_url   : https://en.wikipedia.org/wiki/United_States

如您在此示例中所见，实体分析检查给定文本中的已知实体（专有名词，例如 public 图形、地标等）。它不会为文本中的每个单词提供实体。

Google Cloud NL 实体识别器将单词组合在一起

Google Cloud NL entity recognizer grouping words together

python

nlp

google-cloud-platform

google-cloud-nl