如何使用 similarity.jarowinkler 检查 pandas 列中的术语相似性
How to check term similarity within a pandas column with similarity.jarowinkler
我需要检查列表中的两个或多个单词是否相似。
为此,我使用 Jaro Wrinkler 距离如下:
from similarity.jarowinkler import JaroWinkler
word1='sweet chili'
word2='sriracha chilli'
jarowinkler = JaroWinkler()
print(jarowinkler.similarity(word1, word2))
它似乎能够检测到单词之间的相似性,但我需要将阈值设置为 select 只有相似度为 80% 的单词。
然而,我的困难在于检查数据框列中的所有单词:
Words
sweet chili
sriracha chilli
tomato
mayonnaise
water
milk
still water
sparkling water
wine
chicken
beef
...
我想做的是:
- 从第一个元素开始,检查这个元素和其他元素之间的相似性;如果相似度大于阈值(80%),将其保存在一个新数组中;
- 如上所述检查第二个元素(是拉差辣椒);
- 等等。
你能告诉我如何运行这样一个类似的循环吗?
- 根据给定的数据
- 使用
strsim
包
- 如果真实数据框有很多列,考虑制作一个只有
Words
列的数据框
new_df = pd.DataFrame({'Words': df.Words})
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from similarity.jarowinkler import JaroWinkler
import numpy as np
df = pd.DataFrame({'Words': ['sweet chili', 'sriracha chilli', 'tomato', 'mayonnaise ', 'water', 'milk', 'still water', 'sparkling water', 'wine', 'chicken ', 'beef']})
# call similarity method
jarowinkler = JaroWinkler()
# remove whitespace
df.Words = df.Words.str.strip()
# create column of matching values for each word
words = df.Words.tolist()
for word in words:
df[word] = df.Words.apply(lambda x: jarowinkler.similarity(x, word))
| | Words | sweet chili | sriracha chilli | tomato | mayonnaise | water | milk | still water | sparkling water | wine | chicken | beef |
|---:|:----------------|--------------:|------------------:|---------:|-------------:|---------:|---------:|--------------:|------------------:|---------:|----------:|---------:|
| 0 | sweet chili | 1 | 0.605772 | 0.419192 | 0.39697 | 0.513131 | 0 | 0.515152 | 0.460101 | 0.560606 | 0.322511 | 0.560606 |
| 1 | sriracha chilli | 0.605772 | 1 | 0.411111 | 0.388889 | 0.344444 | 0.438889 | 0.460101 | 0.488889 | 0.438889 | 0.529365 | 0 |
| 2 | tomato | 0.419192 | 0.411111 | 1 | 0.488889 | 0.411111 | 0.472222 | 0.590909 | 0.411111 | 0 | 0 | 0 |
| 3 | mayonnaise | 0.39697 | 0.388889 | 0.488889 | 1 | 0.433333 | 0.45 | 0.460606 | 0.544444 | 0.45 | 0.328571 | 0 |
| 4 | water | 0.513131 | 0.344444 | 0.411111 | 0.433333 | 1 | 0 | 0.430303 | 0.511111 | 0.633333 | 0.447619 | 0.483333 |
| 5 | milk | 0 | 0.438889 | 0.472222 | 0.45 | 0 | 1 | 0.560606 | 0.538889 | 0.5 | 0.595238 | 0 |
| 6 | still water | 0.515152 | 0.460101 | 0.590909 | 0.460606 | 0.430303 | 0.560606 | 1 | 0.749854 | 0.44697 | 0.489177 | 0 |
| 7 | sparkling water | 0.460101 | 0.488889 | 0.411111 | 0.544444 | 0.511111 | 0.538889 | 0.749854 | 1 | 0.544444 | 0.431746 | 0 |
| 8 | wine | 0.560606 | 0.438889 | 0 | 0.45 | 0.633333 | 0.5 | 0.44697 | 0.544444 | 1 | 0.595238 | 0.5 |
| 9 | chicken | 0.322511 | 0.529365 | 0 | 0.328571 | 0.447619 | 0.595238 | 0.489177 | 0.431746 | 0.595238 | 1 | 0 |
| 10 | beef | 0.560606 | 0 | 0 | 0 | 0.483333 | 0 | 0 | 0 | 0.5 | 0 | 1 |
查看大于 80% 的值
- none 除了完全匹配值
df.set_index('Words', inplace=True)
np.where(df[words] > 0.8, df[words], np.nan)
array([[ 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, 1., nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, 1., nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, 1., nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, 1., nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, 1., nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, 1., nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, 1., nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, 1., nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, 1., nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 1.]])
添加热图
mask = np.zeros_like(df[words])
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(7, 5))
ax = sns.heatmap(df[words], mask=mask, square=True, cmap="YlGnBu")
我需要检查列表中的两个或多个单词是否相似。 为此,我使用 Jaro Wrinkler 距离如下:
from similarity.jarowinkler import JaroWinkler
word1='sweet chili'
word2='sriracha chilli'
jarowinkler = JaroWinkler()
print(jarowinkler.similarity(word1, word2))
它似乎能够检测到单词之间的相似性,但我需要将阈值设置为 select 只有相似度为 80% 的单词。 然而,我的困难在于检查数据框列中的所有单词:
Words
sweet chili
sriracha chilli
tomato
mayonnaise
water
milk
still water
sparkling water
wine
chicken
beef
...
我想做的是: - 从第一个元素开始,检查这个元素和其他元素之间的相似性;如果相似度大于阈值(80%),将其保存在一个新数组中; - 如上所述检查第二个元素(是拉差辣椒); - 等等。
你能告诉我如何运行这样一个类似的循环吗?
- 根据给定的数据
- 使用
strsim
包 - 如果真实数据框有很多列,考虑制作一个只有
Words
列的数据框new_df = pd.DataFrame({'Words': df.Words})
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from similarity.jarowinkler import JaroWinkler
import numpy as np
df = pd.DataFrame({'Words': ['sweet chili', 'sriracha chilli', 'tomato', 'mayonnaise ', 'water', 'milk', 'still water', 'sparkling water', 'wine', 'chicken ', 'beef']})
# call similarity method
jarowinkler = JaroWinkler()
# remove whitespace
df.Words = df.Words.str.strip()
# create column of matching values for each word
words = df.Words.tolist()
for word in words:
df[word] = df.Words.apply(lambda x: jarowinkler.similarity(x, word))
| | Words | sweet chili | sriracha chilli | tomato | mayonnaise | water | milk | still water | sparkling water | wine | chicken | beef |
|---:|:----------------|--------------:|------------------:|---------:|-------------:|---------:|---------:|--------------:|------------------:|---------:|----------:|---------:|
| 0 | sweet chili | 1 | 0.605772 | 0.419192 | 0.39697 | 0.513131 | 0 | 0.515152 | 0.460101 | 0.560606 | 0.322511 | 0.560606 |
| 1 | sriracha chilli | 0.605772 | 1 | 0.411111 | 0.388889 | 0.344444 | 0.438889 | 0.460101 | 0.488889 | 0.438889 | 0.529365 | 0 |
| 2 | tomato | 0.419192 | 0.411111 | 1 | 0.488889 | 0.411111 | 0.472222 | 0.590909 | 0.411111 | 0 | 0 | 0 |
| 3 | mayonnaise | 0.39697 | 0.388889 | 0.488889 | 1 | 0.433333 | 0.45 | 0.460606 | 0.544444 | 0.45 | 0.328571 | 0 |
| 4 | water | 0.513131 | 0.344444 | 0.411111 | 0.433333 | 1 | 0 | 0.430303 | 0.511111 | 0.633333 | 0.447619 | 0.483333 |
| 5 | milk | 0 | 0.438889 | 0.472222 | 0.45 | 0 | 1 | 0.560606 | 0.538889 | 0.5 | 0.595238 | 0 |
| 6 | still water | 0.515152 | 0.460101 | 0.590909 | 0.460606 | 0.430303 | 0.560606 | 1 | 0.749854 | 0.44697 | 0.489177 | 0 |
| 7 | sparkling water | 0.460101 | 0.488889 | 0.411111 | 0.544444 | 0.511111 | 0.538889 | 0.749854 | 1 | 0.544444 | 0.431746 | 0 |
| 8 | wine | 0.560606 | 0.438889 | 0 | 0.45 | 0.633333 | 0.5 | 0.44697 | 0.544444 | 1 | 0.595238 | 0.5 |
| 9 | chicken | 0.322511 | 0.529365 | 0 | 0.328571 | 0.447619 | 0.595238 | 0.489177 | 0.431746 | 0.595238 | 1 | 0 |
| 10 | beef | 0.560606 | 0 | 0 | 0 | 0.483333 | 0 | 0 | 0 | 0.5 | 0 | 1 |
查看大于 80% 的值
- none 除了完全匹配值
df.set_index('Words', inplace=True)
np.where(df[words] > 0.8, df[words], np.nan)
array([[ 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, 1., nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, 1., nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, 1., nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, 1., nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, 1., nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, 1., nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, 1., nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, 1., nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, 1., nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 1.]])
添加热图
mask = np.zeros_like(df[words])
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(7, 5))
ax = sns.heatmap(df[words], mask=mask, square=True, cmap="YlGnBu")