"United States" 不是 ["United","States"]

"United States" not ["United","States"]

我在 elasticsearch 中有文本字段,我想在 kibana 上从中可视化词云...

第一步我们需要对它们进行标记,我使用了 "standard tokenizer"... 这种形式的词云可视化结果图片如下:

但我需要的是像 "United States"、"United Nations"、"Security Council" 和 ... 这样的专有名词不能分开,我想要这样的词云: * 专有名词或短语大概在 2-5 个单词之间。 (喜欢 "the People's Republic of China")

我该怎么办? 这与 N-Gram 相关吗?

示例文本:

The United States of America is a charter member of the United Nations and one of five permanent members of the UN Security Council.

The United States is host to the headquarters of the United Nations, which includes the usual meeting place of the General Assembly in New York City, the seat of the Security Council and several bodies of the United Nations. The United States is the largest provider of financial contributions to the United Nations, providing 22 percent of the entire UN budget in 2017 (in comparison the next biggest contributor is Japan with almost 10 percent, while EU countries pay a total of above 30 percent).1 From July 2016 to June 2017, 28.6 percent of the budget used for peacekeeping operations was provided by the United States.2 The United States had a pivotal role in establishing the UN.

此任务是 NER 任务,不是标准标记化任务。有一些插件可以用 elastic 做到这一点,但 none 很有前途。

要完成这项工作,您需要在应用程序端预处理您的数据。使用 NLP 解析器(Standford Core NLP、Spacy...)并提取命名实体。在您的映射中创建一个关键字字段(例如,将其称为实体),您将从每个文档中提取的实体保存为一个数组,然后您可以使用该字段生成您的词云。

祝你好运。