文本比较中的错误值
Wrong value in text comparison
我在下面的数据集中查找文本匹配时遇到了一些困难(请注意 Sim
是我当前的输出,它是由下面的代码 运行 生成的。它显示了错误的匹配) .
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
如上所示,Sim
并没有给出ID
谁写了匹配的文本。
例如,add
应与 gsd
匹配,反之亦然。但我的输出显示 add
与 gwe
匹配,但事实并非如此。
我使用的代码如下:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
你能帮我理解我的代码中的错误吗?可惜我看不到。
我的预期输出如下:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
因为在sim
函数中设置了完美匹配(=1)。
初步假设
首先,由于你的问题对我来说不是百分百清楚,我假设你想对所有行进行成对比较,如果匹配的分数 >100,你想添加匹配行的键。如果不是这样,请指正。
语法问题
所以你上面的代码有很多问题。首先,如果只是复制和粘贴它,语法上不可能 运行 它。 sim()
函数应如下所示:
def sim (nm, df):
matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
注意 df
而不是 dataset
以及 ==
而不是 =
。我还删除了多余的括号以提高可读性。
语义问题
如果我然后 运行 你的代码并打印 t
(这似乎不是最终结果),这给了我以下内容:
txt0 txt1 txt2 txt3 txt4 txt5 txt6 txt7 txt8 txt9
0 1.0 27.0 12.0 45.0 45.0 12.0 12.0 12.0 27.0 64.0
1 27.0 1.0 33.0 33.0 42.0 33.0 33.0 33.0 52.0 44.0
2 12.0 33.0 1.0 22.0 100.0 100.0 100.0 100.0 22.0 33.0
3 45.0 33.0 22.0 1.0 41.0 22.0 22.0 22.0 40.0 30.0
4 45.0 42.0 100.0 41.0 1.0 100.0 100.0 100.0 35.0 47.0
5 12.0 33.0 100.0 22.0 100.0 1.0 100.0 100.0 22.0 33.0
6 12.0 33.0 100.0 22.0 100.0 100.0 1.0 100.0 22.0 33.0
7 12.0 33.0 100.0 22.0 100.0 100.0 100.0 1.0 22.0 33.0
8 27.0 52.0 22.0 40.0 35.0 22.0 22.0 22.0 1.0 34.0
9 64.0 44.0 33.0 30.0 47.0 33.0 33.0 33.0 34.0 1.0
这对我来说似乎是正确的,因为 fuzz.partial_ratio("wonderful end ever seen you", "wonderful")
returns 100
(因为部分匹配已经被认为是 100 分)。
出于一致性原因,您可以更改
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
到
t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
因为所有元素都应该与自身完美匹配。所以当你说
But my output says that add matches with gwe and this is not true.
这在 fuzz.partial_ratio()
的意义上是正确的,您可能要考虑改用 fuzz.ratio()
。此外,将 t
转换为新的 Sim
列时可能会出错,但提供的示例中似乎没有代码。
替代实施
此外,正如一些评论所建议的那样,有时重组代码会很有帮助,这样人们就可以更轻松地帮助您。这是一个示例:
import re
import pandas as pd
from fuzzywuzzy import fuzz
data = """
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
"""
rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"]) # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID') # Assuming that the "ID" column holds a unique ID
comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']] # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100] # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID') # Cleanup
result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())
给出:
Text Sim
ID
add wonderful gsd
add wonderful gwe
dfsfs i love this its incredible ...
fdsdf best sport everand the gane of the year❤️❤️❤️❤...
fdsfdgte3e best match ever its a masterpiece
fsad amazing ...
gsd wonderful gwe
gsd wonderful add
gwe wonderful end ever seen you ... gsd
gwe wonderful end ever seen you ... add
hgdfgre terrible destroys everything ...
我在下面的数据集中查找文本匹配时遇到了一些困难(请注意 Sim
是我当前的输出,它是由下面的代码 运行 生成的。它显示了错误的匹配) .
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
如上所示,Sim
并没有给出ID
谁写了匹配的文本。
例如,add
应与 gsd
匹配,反之亦然。但我的输出显示 add
与 gwe
匹配,但事实并非如此。
我使用的代码如下:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
你能帮我理解我的代码中的错误吗?可惜我看不到。
我的预期输出如下:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
因为在sim
函数中设置了完美匹配(=1)。
初步假设
首先,由于你的问题对我来说不是百分百清楚,我假设你想对所有行进行成对比较,如果匹配的分数 >100,你想添加匹配行的键。如果不是这样,请指正。
语法问题
所以你上面的代码有很多问题。首先,如果只是复制和粘贴它,语法上不可能 运行 它。 sim()
函数应如下所示:
def sim (nm, df):
matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
注意 df
而不是 dataset
以及 ==
而不是 =
。我还删除了多余的括号以提高可读性。
语义问题
如果我然后 运行 你的代码并打印 t
(这似乎不是最终结果),这给了我以下内容:
txt0 txt1 txt2 txt3 txt4 txt5 txt6 txt7 txt8 txt9
0 1.0 27.0 12.0 45.0 45.0 12.0 12.0 12.0 27.0 64.0
1 27.0 1.0 33.0 33.0 42.0 33.0 33.0 33.0 52.0 44.0
2 12.0 33.0 1.0 22.0 100.0 100.0 100.0 100.0 22.0 33.0
3 45.0 33.0 22.0 1.0 41.0 22.0 22.0 22.0 40.0 30.0
4 45.0 42.0 100.0 41.0 1.0 100.0 100.0 100.0 35.0 47.0
5 12.0 33.0 100.0 22.0 100.0 1.0 100.0 100.0 22.0 33.0
6 12.0 33.0 100.0 22.0 100.0 100.0 1.0 100.0 22.0 33.0
7 12.0 33.0 100.0 22.0 100.0 100.0 100.0 1.0 22.0 33.0
8 27.0 52.0 22.0 40.0 35.0 22.0 22.0 22.0 1.0 34.0
9 64.0 44.0 33.0 30.0 47.0 33.0 33.0 33.0 34.0 1.0
这对我来说似乎是正确的,因为 fuzz.partial_ratio("wonderful end ever seen you", "wonderful")
returns 100
(因为部分匹配已经被认为是 100 分)。
出于一致性原因,您可以更改
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
到
t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
因为所有元素都应该与自身完美匹配。所以当你说
But my output says that add matches with gwe and this is not true.
这在 fuzz.partial_ratio()
的意义上是正确的,您可能要考虑改用 fuzz.ratio()
。此外,将 t
转换为新的 Sim
列时可能会出错,但提供的示例中似乎没有代码。
替代实施
此外,正如一些评论所建议的那样,有时重组代码会很有帮助,这样人们就可以更轻松地帮助您。这是一个示例:
import re
import pandas as pd
from fuzzywuzzy import fuzz
data = """
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
"""
rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"]) # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID') # Assuming that the "ID" column holds a unique ID
comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']] # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100] # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID') # Cleanup
result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())
给出:
Text Sim
ID
add wonderful gsd
add wonderful gwe
dfsfs i love this its incredible ...
fdsdf best sport everand the gane of the year❤️❤️❤️❤...
fdsfdgte3e best match ever its a masterpiece
fsad amazing ...
gsd wonderful gwe
gsd wonderful add
gwe wonderful end ever seen you ... gsd
gwe wonderful end ever seen you ... add
hgdfgre terrible destroys everything ...