查找两个大的非结构化文本文件之间的常用词
Find common words between two big unstructured text files
我有两个大的非结构化文本文件无法放入内存。我想找到他们之间的共同词。
最有效(时间和 space)的方法是什么?
谢谢
我给了这两个文件:
pi_poem
Now I will a rhyme construct
By chosen words the young instruct
I do not like green eggs and ham
I do not like them Sam I am
pi_prose
The thing I like best about pi is the magic it does with circles.
Even young kids can have fun with the simple integer approximations.
代码很简单。第一个循环逐行读取第一个文件,将单词粘贴到词典集中。第二个循环读取第二个文件;它在第一个文件的词典中找到的每个词都会进入一组常用词。
这能满足您的需求吗?您需要根据标点符号对其进行调整,并且您可能希望在更改后删除多余的打印内容。
lexicon = set()
with open("pi_poem", 'r') as text:
for line in text.readlines():
for word in line.split():
if not word in lexicon:
lexicon.add(word)
print lexicon
common = set()
with open("pi_prose", 'r') as text:
for line in text.readlines():
for word in line.split():
if word in lexicon:
common.add(word)
print common
输出:
set(['and', 'am', 'instruct', 'ham', 'chosen', 'young', 'construct', 'Now', 'By', 'do', 'them', 'I', 'eggs', 'rhyme', 'words', 'not', 'a', 'like', 'Sam', 'will', 'green', 'the'])
set(['I', 'the', 'like', 'young'])
我有两个大的非结构化文本文件无法放入内存。我想找到他们之间的共同词。
最有效(时间和 space)的方法是什么?
谢谢
我给了这两个文件:
pi_poem
Now I will a rhyme construct
By chosen words the young instruct
I do not like green eggs and ham
I do not like them Sam I am
pi_prose
The thing I like best about pi is the magic it does with circles.
Even young kids can have fun with the simple integer approximations.
代码很简单。第一个循环逐行读取第一个文件,将单词粘贴到词典集中。第二个循环读取第二个文件;它在第一个文件的词典中找到的每个词都会进入一组常用词。
这能满足您的需求吗?您需要根据标点符号对其进行调整,并且您可能希望在更改后删除多余的打印内容。
lexicon = set()
with open("pi_poem", 'r') as text:
for line in text.readlines():
for word in line.split():
if not word in lexicon:
lexicon.add(word)
print lexicon
common = set()
with open("pi_prose", 'r') as text:
for line in text.readlines():
for word in line.split():
if word in lexicon:
common.add(word)
print common
输出:
set(['and', 'am', 'instruct', 'ham', 'chosen', 'young', 'construct', 'Now', 'By', 'do', 'them', 'I', 'eggs', 'rhyme', 'words', 'not', 'a', 'like', 'Sam', 'will', 'green', 'the'])
set(['I', 'the', 'like', 'young'])