在文本中查找重复的句子
Find repeated sentences within text
我想知道如何在同一个句子中找到相似之处。
我有一个像这样的句子列表:
my_list=["do you want pizza for dinner? Do you want pizza for dinner?", "I like pizza", "I have no money I have no money"]
我想创建一个 pandas 数据框,如果同一个句子中重复,我分配 1,否则分配 0。
像这样:
Text Repeated?
do you want pizza for dinner? Do you want pizza for dinner? 1
I like pizza 0
I have no money I have no money 1
我在想这样的事情:
from collections import Counter
my_list = dict(Counter(my_list.split()))
for i in sorted(my_list.keys()):
print ('"'+i+'" is repeated '+str(my_list[i])+' time.')
然后统计那句话总共有多少个单词,总共有多少个不重复的单词。但我认为这不如编码好。
不知道有没有其他方法可以达到预期的效果?
您可以对任务使用正则表达式 (regex101):
import re
import pandas as pd
my_list=["do you want pizza for dinner? Do you want pizza for dinner?", "I like pizza", "I have no money I have no money"]
df = pd.DataFrame({'Text': my_list})
r = re.compile(r'(.+)\s*$', flags=re.I)
df['Repeated'] = df['Text'].apply(lambda x: bool(r.match(x))).astype(int)
print(df)
打印:
Text Repeated
0 do you want pizza for dinner? Do you want pizz... 1
1 I like pizza 0
2 I have no money I have no money 1
我想知道如何在同一个句子中找到相似之处。 我有一个像这样的句子列表:
my_list=["do you want pizza for dinner? Do you want pizza for dinner?", "I like pizza", "I have no money I have no money"]
我想创建一个 pandas 数据框,如果同一个句子中重复,我分配 1,否则分配 0。
像这样:
Text Repeated?
do you want pizza for dinner? Do you want pizza for dinner? 1
I like pizza 0
I have no money I have no money 1
我在想这样的事情:
from collections import Counter
my_list = dict(Counter(my_list.split()))
for i in sorted(my_list.keys()):
print ('"'+i+'" is repeated '+str(my_list[i])+' time.')
然后统计那句话总共有多少个单词,总共有多少个不重复的单词。但我认为这不如编码好。 不知道有没有其他方法可以达到预期的效果?
您可以对任务使用正则表达式 (regex101):
import re
import pandas as pd
my_list=["do you want pizza for dinner? Do you want pizza for dinner?", "I like pizza", "I have no money I have no money"]
df = pd.DataFrame({'Text': my_list})
r = re.compile(r'(.+)\s*$', flags=re.I)
df['Repeated'] = df['Text'].apply(lambda x: bool(r.match(x))).astype(int)
print(df)
打印:
Text Repeated
0 do you want pizza for dinner? Do you want pizz... 1
1 I like pizza 0
2 I have no money I have no money 1