Python 模糊字符串匹配作为相关样式 table/matrix
Python fuzzy string matching as correlation style table/matrix
我有一个包含 x 个字符串名称及其关联 ID 的文件。本质上是两列数据。
我想要的是格式为 x x x 的相关样式 table(将相关数据同时作为 x 轴和 y 轴),但我想要的不是相关性fuzzywuzzy 库的函数 fuzz.ratio(x,y) 使用字符串名称作为输入作为输出。基本上 运行 每个条目对每个条目。
这就是我的想法。只是为了表明我的意图:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
但显然这种方法目前对我不起作用。任何帮助表示赞赏。不一定是pandas,只是我比较熟悉的环境而已。
我希望我的问题措辞清楚,真的,任何意见都将受到赞赏,
import csv
from fuzzywuzzy import fuzz
import numpy as np
input_file = csv.DictReader(open('random_data_file.csv'))
string = []
for row in input_file: #file is appended row by row into a python dictionary
string.append(row["String"]) #keys for the dict. are the headers
#now you have a list of the string values
length = len(string)
resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X
for i in range (length):
for j in range (length):
resultMat[i][j] = fuzz.ratio(string[i], string[j])
print resultMat
我在 numby 二维矩阵中进行了实现。我在 pandas 方面不太擅长,但我认为您正在做的是添加另一列并将其与字符串列进行比较,这意味着:字符串 [i] 将被匹配使用 string_dub[i],所有结果将为 100
希望对您有所帮助
在pandas中,可以使用虚拟变量和pd.merge
创建两列之间的笛卡尔叉积。使用 apply
应用 fuzz
操作。最后的数据透视操作将提取您想要的格式。为简单起见,我省略了 groupby
操作,但是当然,您可以通过将下面的代码移到一个单独的函数中来将该过程应用于所有组表。
这可能是这样的:
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Cross product, using a temporary column.
df['_tmp'] = 0
mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])
# Apply the function between the two strings.
mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)
# Reorganize data.
ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
ret.index.name = None
ret.columns.name = None
# This results in the following:
# abc abracadabra brabra cadra
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
使用pandas' crosstab
function, followed by a column-wise apply
计算模糊度。
这比我的第一个答案要优雅得多。
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())
# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
# This results in the following:
# strings abc abracadabra brabra cadra
# strings
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
为简单起见,我省略了您问题中建议的 groupby
操作。如果需要对组应用模糊字符串匹配,只需创建一个单独的函数:
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
df.groupby('id').apply(cross_fuzz)
我有一个包含 x 个字符串名称及其关联 ID 的文件。本质上是两列数据。
我想要的是格式为 x x x 的相关样式 table(将相关数据同时作为 x 轴和 y 轴),但我想要的不是相关性fuzzywuzzy 库的函数 fuzz.ratio(x,y) 使用字符串名称作为输入作为输出。基本上 运行 每个条目对每个条目。
这就是我的想法。只是为了表明我的意图:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
但显然这种方法目前对我不起作用。任何帮助表示赞赏。不一定是pandas,只是我比较熟悉的环境而已。
我希望我的问题措辞清楚,真的,任何意见都将受到赞赏,
import csv
from fuzzywuzzy import fuzz
import numpy as np
input_file = csv.DictReader(open('random_data_file.csv'))
string = []
for row in input_file: #file is appended row by row into a python dictionary
string.append(row["String"]) #keys for the dict. are the headers
#now you have a list of the string values
length = len(string)
resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X
for i in range (length):
for j in range (length):
resultMat[i][j] = fuzz.ratio(string[i], string[j])
print resultMat
我在 numby 二维矩阵中进行了实现。我在 pandas 方面不太擅长,但我认为您正在做的是添加另一列并将其与字符串列进行比较,这意味着:字符串 [i] 将被匹配使用 string_dub[i],所有结果将为 100
希望对您有所帮助
在pandas中,可以使用虚拟变量和pd.merge
创建两列之间的笛卡尔叉积。使用 apply
应用 fuzz
操作。最后的数据透视操作将提取您想要的格式。为简单起见,我省略了 groupby
操作,但是当然,您可以通过将下面的代码移到一个单独的函数中来将该过程应用于所有组表。
这可能是这样的:
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Cross product, using a temporary column.
df['_tmp'] = 0
mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])
# Apply the function between the two strings.
mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)
# Reorganize data.
ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
ret.index.name = None
ret.columns.name = None
# This results in the following:
# abc abracadabra brabra cadra
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
使用pandas' crosstab
function, followed by a column-wise apply
计算模糊度。
这比我的第一个答案要优雅得多。
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())
# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
# This results in the following:
# strings abc abracadabra brabra cadra
# strings
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
为简单起见,我省略了您问题中建议的 groupby
操作。如果需要对组应用模糊字符串匹配,只需创建一个单独的函数:
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
df.groupby('id').apply(cross_fuzz)