GCP 情感分析 returns 17 个不同文档的得分相同,我做错了什么?

GCP Sentiment Analysis returns same score for 17 different documents, what am I doing wrong?

我是 运行 Google Cloud Platform 对 17 个不同文档的情感分析,但它给了我相同的分数,每个分数不同。 这是我第一次使用这个包,但据我所知,所有这些都具有完全相同的分数是不可能的。

这些文件是不同大小的 pdf 文件,但在 15-20 页之间,我排除了其中的 3 个,因为它们不相关。

我已经用其他文档尝试过该代码,它对较短的文档给出了不同的分数,我怀疑它可以处理的文档有最大长度,但在文档中或通过 [=74 找不到任何内容=].

def analyze(text):
    client = language.LanguageServiceClient(credentials=creds)    

    document = types.Document(content=text, 
        type=enums.Document.Type.PLAIN_TEXT)

    sentiment = client.analyze_sentiment(document=document).document_sentiment
    entities = client.analyze_entities(document=document).entities

    return sentiment, entities


def extract_text_from_pdf_pages(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        last_page = len(list(enumerate(PDFPage.get_pages(fh, caching=True, check_extractable=True))))-1

        for pgNum, page in enumerate(PDFPage.get_pages(fh, 
                                  caching=True,
                                  check_extractable=True)):

            if pgNum not in [0,1, last_page]:
                page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

结果(分数,量级):

文档 1 0.10000000149011612 - 147.5


文档2 0.10000000149011612 - 118.30000305175781


doc3 0.10000000149011612 - 144.0


doc4 0.10000000149011612 - 147.10000610351562


文档5 0.10000000149011612 - 131.39999389648438


doc6 0.10000000149011612 - 116.19999694824219


文档7 0.10000000149011612 - 121.0999984741211


doc8 0.10000000149011612 - 131.60000610351562


doc9 0.10000000149011612 - 97.69999694824219


doc10 0.10000000149011612 - 174.89999389648438


doc11 0.10000000149011612 - 138.8000030517578


doc12 0.10000000149011612 - 141.10000610351562


doc13 0.10000000149011612 - 118.5999984741211


doc14 0.10000000149011612 - 135.60000610351562


doc15 0.10000000149011612 - 127.0


doc16 0.10000000149011612 - 97.0999984741211


doc17 0.10000000149011612 - 183.5


所有文件的预期结果都不同,至少是小的变化。 (与我在文档和其他地方找到的相比,我认为这些幅度分数也太高了)

是的,有一些quotas in the usage of the Natural Language API

自然语言API 将文本处理成一系列标记,大致对应于单词边界。尝试处理超过令牌配额(默认情况下每个查询 100.000 个令牌)的令牌不会产生错误,但超过该配额的任何令牌都将被忽略

对于第二个问题,我很难在无法访问文档的情况下评估自然语言 API 的结果。也许因为它们太中性,你得到的结果非常相似。我有 运行 一些带有大中性文本的测试,我得到了类似的结果。

澄清一下,as stated in the Natural Language API documentation

  • documentSentiment contains the overall sentiment of the document, which consists of the following fields:
    • score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.
    • magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes).