寻找相似的阶段
Finding similar phases
如何在大量阶段(即推文或电影评论)中找到相似的阶段?
例如'I like chocolate'
类似于'I like chocolate bar'
和'I like mango'
;与 'I ate apple'
和 'I ate apples'
.
相同
import pandas as pd
data = {'Text': ['I like chocolate',
'I like chocolate bar',
'I ate apple',
'I ate apples',
'I like mango',
'I can swim']
}
df = pd.DataFrame (data, columns = ['Text'])
尝试使用 jellyfish
中的 soundex
,相同的输出应该有相似的阶段
import jellyfish
df.Text.map(jellyfish.soundex)
0 I422
1 I422
2 I314
3 I314
4 I425
5 I252
Name: Text, dtype: object
为了找到sentences/phrases之间的相似性,一个简单的技术是使用bag-of-words model and apply similarity computation technique such as cosine similarity. This can be improved further by using word vectors。
从 fuzzywuzzy
包中,使用 extractWithoutOrder
extract
的未排序版本来查找字符串之间的相似性:
# pip install fuzzywuzzy
# conda install -c conda-forge fuzzywuzzy
from fuzzywuzzy.process import extractWithoutOrder as extract
from operator import itemgetter
ratio = df["Text"].apply(lambda s: list(map(itemgetter(1), extract(s, df["Text"]))))
out = pd.DataFrame(ratio.tolist(), index=df.index, columns=df.index)
>>> out
0 1 2 3 4 5
0 100 95 44 43 64 86
1 95 100 86 86 86 86
2 44 86 100 96 49 38
3 43 86 96 100 48 45
4 64 86 49 48 100 36
5 86 86 38 45 36 100
如何在大量阶段(即推文或电影评论)中找到相似的阶段?
例如'I like chocolate'
类似于'I like chocolate bar'
和'I like mango'
;与 'I ate apple'
和 'I ate apples'
.
import pandas as pd
data = {'Text': ['I like chocolate',
'I like chocolate bar',
'I ate apple',
'I ate apples',
'I like mango',
'I can swim']
}
df = pd.DataFrame (data, columns = ['Text'])
尝试使用 jellyfish
中的 soundex
,相同的输出应该有相似的阶段
import jellyfish
df.Text.map(jellyfish.soundex)
0 I422
1 I422
2 I314
3 I314
4 I425
5 I252
Name: Text, dtype: object
为了找到sentences/phrases之间的相似性,一个简单的技术是使用bag-of-words model and apply similarity computation technique such as cosine similarity. This can be improved further by using word vectors。
从 fuzzywuzzy
包中,使用 extractWithoutOrder
extract
的未排序版本来查找字符串之间的相似性:
# pip install fuzzywuzzy
# conda install -c conda-forge fuzzywuzzy
from fuzzywuzzy.process import extractWithoutOrder as extract
from operator import itemgetter
ratio = df["Text"].apply(lambda s: list(map(itemgetter(1), extract(s, df["Text"]))))
out = pd.DataFrame(ratio.tolist(), index=df.index, columns=df.index)
>>> out
0 1 2 3 4 5
0 100 95 44 43 64 86
1 95 100 86 86 86 86
2 44 86 100 96 49 38
3 43 86 96 100 48 45
4 64 86 49 48 100 36
5 86 86 38 45 36 100