寻找相似的阶段

Finding similar phases

如何在大量阶段(即推文或电影评论)中找到相似的阶段?

例如'I like chocolate'类似于'I like chocolate bar''I like mango';与 'I ate apple''I ate apples'.

相同
import pandas as pd

data = {'Text':  ['I like chocolate',
                  'I like chocolate bar',
                  'I ate apple',
                  'I ate apples',
                  'I like mango',
                  'I can swim']  
        }

df = pd.DataFrame (data, columns = ['Text'])

尝试使用 jellyfish 中的 soundex,相同的输出应该有相似的阶段

import jellyfish
df.Text.map(jellyfish.soundex)
0    I422
1    I422
2    I314
3    I314
4    I425
5    I252
Name: Text, dtype: object

为了找到sentences/phrases之间的相似性,一个简单的技术是使用bag-of-words model and apply similarity computation technique such as cosine similarity. This can be improved further by using word vectors

fuzzywuzzy 包中,使用 extractWithoutOrder extract 的未排序版本来查找字符串之间的相似性:

# pip install fuzzywuzzy
# conda install -c conda-forge fuzzywuzzy 
from fuzzywuzzy.process import extractWithoutOrder as extract
from operator import itemgetter

ratio = df["Text"].apply(lambda s: list(map(itemgetter(1), extract(s, df["Text"]))))
out = pd.DataFrame(ratio.tolist(), index=df.index, columns=df.index)
>>> out
     0    1    2    3    4    5
0  100   95   44   43   64   86
1   95  100   86   86   86   86
2   44   86  100   96   49   38
3   43   86   96  100   48   45
4   64   86   49   48  100   36
5   86   86   38   45   36  100