如何在 Python 中的文本中找到哪些字符串(在一大串字符串中)?
How to find what strings (in a big list of strings) are in a text in Python?
我正在尝试找出新闻文本中的列表名称。
我有一个包含许多地名的大文本文件(大约 100MB)。每个名称在文件中占一行。
部分文件。
Brasiel
Brasier Gap
Brasier Tank
Brasiilia
Brasil
Brasil Colonial
新闻正文是这样的:
"It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials.
Hanks is not the only celebrity to have tested positive for the virus. British actor Idris Elba also revealed last week he had tested positive."
例如,在本文中应创建字符串Australia 和Queensland。
我正在使用 NLTK 库并根据新闻创建 ngram。
为此,我正在这样做:
from nltk.util import ngrams
# readings the place name file
file = open("top-ord.txt", "r")
values = file.readlines()
news = "It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials."
# ngrams_list is all ngrams from the news
for item in ngrams_list:
if item in values:
print(item)
这太慢了。我该如何改进它?
像这样将值转换为集合:
value_set = {country for country in values}
这应该会显着加快速度,因为集合查找在恒定时间内运行(与列表的线性时间相反)
另外,确保在解析文件时去掉尾随的换行符(如果需要)。
我正在尝试找出新闻文本中的列表名称。
我有一个包含许多地名的大文本文件(大约 100MB)。每个名称在文件中占一行。
部分文件。
Brasiel
Brasier Gap
Brasier Tank
Brasiilia
Brasil
Brasil Colonial
新闻正文是这样的:
"It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials.
Hanks is not the only celebrity to have tested positive for the virus. British actor Idris Elba also revealed last week he had tested positive."
例如,在本文中应创建字符串Australia 和Queensland。 我正在使用 NLTK 库并根据新闻创建 ngram。
为此,我正在这样做:
from nltk.util import ngrams
# readings the place name file
file = open("top-ord.txt", "r")
values = file.readlines()
news = "It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials."
# ngrams_list is all ngrams from the news
for item in ngrams_list:
if item in values:
print(item)
这太慢了。我该如何改进它?
像这样将值转换为集合:
value_set = {country for country in values}
这应该会显着加快速度,因为集合查找在恒定时间内运行(与列表的线性时间相反)
另外,确保在解析文件时去掉尾随的换行符(如果需要)。