GCP 情感分析 returns 17 个不同文档的得分相同,我做错了什么?
GCP Sentiment Analysis returns same score for 17 different documents, what am I doing wrong?
我是 运行 Google Cloud Platform 对 17 个不同文档的情感分析,但它给了我相同的分数,每个分数不同。
这是我第一次使用这个包,但据我所知,所有这些都具有完全相同的分数是不可能的。
这些文件是不同大小的 pdf 文件,但在 15-20 页之间,我排除了其中的 3 个,因为它们不相关。
我已经用其他文档尝试过该代码,它对较短的文档给出了不同的分数,我怀疑它可以处理的文档有最大长度,但在文档中或通过 [=74 找不到任何内容=].
def analyze(text):
client = language.LanguageServiceClient(credentials=creds)
document = types.Document(content=text,
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
entities = client.analyze_entities(document=document).entities
return sentiment, entities
def extract_text_from_pdf_pages(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(pdf_path, 'rb') as fh:
last_page = len(list(enumerate(PDFPage.get_pages(fh, caching=True, check_extractable=True))))-1
for pgNum, page in enumerate(PDFPage.get_pages(fh,
caching=True,
check_extractable=True)):
if pgNum not in [0,1, last_page]:
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
if text:
return text
结果(分数,量级):
文档 1
0.10000000149011612 - 147.5
文档2
0.10000000149011612 - 118.30000305175781
doc3
0.10000000149011612 - 144.0
doc4
0.10000000149011612 - 147.10000610351562
文档5
0.10000000149011612 - 131.39999389648438
doc6
0.10000000149011612 - 116.19999694824219
文档7
0.10000000149011612 - 121.0999984741211
doc8
0.10000000149011612 - 131.60000610351562
doc9
0.10000000149011612 - 97.69999694824219
doc10
0.10000000149011612 - 174.89999389648438
doc11
0.10000000149011612 - 138.8000030517578
doc12
0.10000000149011612 - 141.10000610351562
doc13
0.10000000149011612 - 118.5999984741211
doc14
0.10000000149011612 - 135.60000610351562
doc15
0.10000000149011612 - 127.0
doc16
0.10000000149011612 - 97.0999984741211
doc17
0.10000000149011612 - 183.5
所有文件的预期结果都不同,至少是小的变化。
(与我在文档和其他地方找到的相比,我认为这些幅度分数也太高了)
是的,有一些quotas in the usage of the Natural Language API。
自然语言API 将文本处理成一系列标记,大致对应于单词边界。尝试处理超过令牌配额(默认情况下每个查询 100.000 个令牌)的令牌不会产生错误,但超过该配额的任何令牌都将被忽略。
对于第二个问题,我很难在无法访问文档的情况下评估自然语言 API 的结果。也许因为它们太中性,你得到的结果非常相似。我有 运行 一些带有大中性文本的测试,我得到了类似的结果。
澄清一下,as stated in the Natural Language API documentation:
- documentSentiment contains the overall sentiment of the document, which consists of the following fields:
- score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.
- magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes).
我是 运行 Google Cloud Platform 对 17 个不同文档的情感分析,但它给了我相同的分数,每个分数不同。 这是我第一次使用这个包,但据我所知,所有这些都具有完全相同的分数是不可能的。
这些文件是不同大小的 pdf 文件,但在 15-20 页之间,我排除了其中的 3 个,因为它们不相关。
我已经用其他文档尝试过该代码,它对较短的文档给出了不同的分数,我怀疑它可以处理的文档有最大长度,但在文档中或通过 [=74 找不到任何内容=].
def analyze(text):
client = language.LanguageServiceClient(credentials=creds)
document = types.Document(content=text,
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
entities = client.analyze_entities(document=document).entities
return sentiment, entities
def extract_text_from_pdf_pages(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(pdf_path, 'rb') as fh:
last_page = len(list(enumerate(PDFPage.get_pages(fh, caching=True, check_extractable=True))))-1
for pgNum, page in enumerate(PDFPage.get_pages(fh,
caching=True,
check_extractable=True)):
if pgNum not in [0,1, last_page]:
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
if text:
return text
结果(分数,量级):
文档 1 0.10000000149011612 - 147.5
文档2 0.10000000149011612 - 118.30000305175781
doc3 0.10000000149011612 - 144.0
doc4 0.10000000149011612 - 147.10000610351562
文档5 0.10000000149011612 - 131.39999389648438
doc6 0.10000000149011612 - 116.19999694824219
文档7 0.10000000149011612 - 121.0999984741211
doc8 0.10000000149011612 - 131.60000610351562
doc9 0.10000000149011612 - 97.69999694824219
doc10 0.10000000149011612 - 174.89999389648438
doc11 0.10000000149011612 - 138.8000030517578
doc12 0.10000000149011612 - 141.10000610351562
doc13 0.10000000149011612 - 118.5999984741211
doc14 0.10000000149011612 - 135.60000610351562
doc15 0.10000000149011612 - 127.0
doc16 0.10000000149011612 - 97.0999984741211
doc17 0.10000000149011612 - 183.5
所有文件的预期结果都不同,至少是小的变化。 (与我在文档和其他地方找到的相比,我认为这些幅度分数也太高了)
是的,有一些quotas in the usage of the Natural Language API。
自然语言API 将文本处理成一系列标记,大致对应于单词边界。尝试处理超过令牌配额(默认情况下每个查询 100.000 个令牌)的令牌不会产生错误,但超过该配额的任何令牌都将被忽略。
对于第二个问题,我很难在无法访问文档的情况下评估自然语言 API 的结果。也许因为它们太中性,你得到的结果非常相似。我有 运行 一些带有大中性文本的测试,我得到了类似的结果。
澄清一下,as stated in the Natural Language API documentation:
- documentSentiment contains the overall sentiment of the document, which consists of the following fields:
- score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text.
- magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes).