我如何在 PyCharm 中 运行 这个 Polyglot token/tag 提取器?
How can I run this Polyglot token/tag extractor in PyCharm?
我正在评估各种命名实体识别 (NER) 库,并且正在试用 Polyglot。
一切似乎都很顺利,但说明告诉我在命令提示符中使用这一行:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20
...应该给出(在示例中)此输出:
, O
which O
was O
equalled O
five O
days O
ago O
by O
South I-LOC
Africa I-LOC
in O
their O
victory O
over O
West I-ORG
Indies I-ORG
in O
Sydney I-LOC
. O
这正是我的项目需要的那种输出,它的工作方式与我需要的完全一样;但是,我需要 运行 在我的 PyCharm 界面中,而不是在命令行中,并将结果存储在 pandas 数据框中。我该如何翻译该命令?
假设已正确安装多语言并且 select 在 pycharm 中设置了适当的环境。如果没有在具有必要要求的 new conda environment
中安装 polyglot。在 pycharm 中创建一个新项目和 select 现有的 conda 环境。如果 language embeddings
、ner
模型不是 downloaded
,则应下载它们。
代码:
from polyglot.text import Text
blob = """, which was equalled five days ago by South Africa in the victory over West Indies in Sydney."""
text = Text(blob)
text.language = "en"
## As list all detected entities
print("As list all detected entities")
print(text.entities)
print()
## Separately shown detected entities
print("Separately shown detected entities")
for entity in text.entities:
print(entity.tag, entity)
print()
## Tokenized words of sentence
print("Tokenized words of sentence")
print(text.words)
print()
## For each token try named entity recognition.
## Not very reliable it detects some words as not English and tries other languages.
## If other embeddings are not installed or text.language = "en" is commented then it may give error.
print("For each token try named entity recognition")
for word in text.words:
text = Text(word)
text.language = "en"
## Separately
for entity in text.entities:
print(entity.tag, entity)
输出:
As list all detected entities
[I-LOC(['South', 'Africa']), I-ORG(['West', 'Indies']), I-LOC(['Sydney'])]
Separately shown detected entities
I-LOC ['South', 'Africa']
I-ORG ['West', 'Indies']
I-LOC ['Sydney']
Tokenized words of sentence
[',', 'which', 'was', 'equalled', 'five', 'days', 'ago', 'by', 'South', 'Africa', 'in', 'the', 'victory', 'over', 'West', 'Indies', 'in', 'Sydney', '.']
For each token try named entity recognition
I-LOC ['Africa']
I-PER ['Sydney']
我正在评估各种命名实体识别 (NER) 库,并且正在试用 Polyglot。
一切似乎都很顺利,但说明告诉我在命令提示符中使用这一行:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20
...应该给出(在示例中)此输出:
, O
which O
was O
equalled O
five O
days O
ago O
by O
South I-LOC
Africa I-LOC
in O
their O
victory O
over O
West I-ORG
Indies I-ORG
in O
Sydney I-LOC
. O
这正是我的项目需要的那种输出,它的工作方式与我需要的完全一样;但是,我需要 运行 在我的 PyCharm 界面中,而不是在命令行中,并将结果存储在 pandas 数据框中。我该如何翻译该命令?
假设已正确安装多语言并且 select 在 pycharm 中设置了适当的环境。如果没有在具有必要要求的 new conda environment
中安装 polyglot。在 pycharm 中创建一个新项目和 select 现有的 conda 环境。如果 language embeddings
、ner
模型不是 downloaded
,则应下载它们。
代码:
from polyglot.text import Text
blob = """, which was equalled five days ago by South Africa in the victory over West Indies in Sydney."""
text = Text(blob)
text.language = "en"
## As list all detected entities
print("As list all detected entities")
print(text.entities)
print()
## Separately shown detected entities
print("Separately shown detected entities")
for entity in text.entities:
print(entity.tag, entity)
print()
## Tokenized words of sentence
print("Tokenized words of sentence")
print(text.words)
print()
## For each token try named entity recognition.
## Not very reliable it detects some words as not English and tries other languages.
## If other embeddings are not installed or text.language = "en" is commented then it may give error.
print("For each token try named entity recognition")
for word in text.words:
text = Text(word)
text.language = "en"
## Separately
for entity in text.entities:
print(entity.tag, entity)
输出:
As list all detected entities
[I-LOC(['South', 'Africa']), I-ORG(['West', 'Indies']), I-LOC(['Sydney'])]
Separately shown detected entities
I-LOC ['South', 'Africa']
I-ORG ['West', 'Indies']
I-LOC ['Sydney']
Tokenized words of sentence
[',', 'which', 'was', 'equalled', 'five', 'days', 'ago', 'by', 'South', 'Africa', 'in', 'the', 'victory', 'over', 'West', 'Indies', 'in', 'Sydney', '.']
For each token try named entity recognition
I-LOC ['Africa']
I-PER ['Sydney']