Python:fuzzywuzzy,输出第一个值是正确的,其他都是NaN
Python: fuzzywuzzy, the output of the first value is correct, the others are NaN
我遇到了一个很奇怪的问题:
我有两个 df,我必须通过相似性将一个 df 的字符串与另一个 df 的字符串匹配。
目标列是电视节目的名称 (program_name_1 & program_name_2)。
为了让他从有限的一组数据中进行选择,我还使用了'channel'列作为过滤器。
该函数应用模糊算法并给出 program_name_1 列的元素与 program_name_2 的匹配结果以及它们之间的相似度分数。
真正奇怪的是,输出只对第一个通道有效,但对所有下一个通道都不起作用。仅打印 program_name_1 的第一列 (scorer_test_2) 始终正确,但 scorer_test_2(应打印 program_name_2)和相似性列为 NaN。
我对 dfs 做了很多检查:我确信列的名称与列表中的名称相同,并且在其他通道中,有我要的所有数据.
最奇怪的是第一个通道和所有其他通道都在同一个df中,因此通道数据之间没有差异。
我会告诉你 'toys dts',让你更好地理解问题:
df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])
将为 df1 打印:
Channel program_name_1
1 party
1 animals
1 gucci
2 the simpson
2 cars
2 mathematics
3 bikes
4 chef
对于 df2:
Channel program_name_2
1 parties
1 gucci_gucci
1 animal
2 simpsons
2 math
2 the car
3 bike
4 cooking
这里是代码:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']
# creation of a function for the score
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
print(scorer_tester_function('R').head())
我想为所有频道获得的输出,但如果我通过代码中的第一个频道,我只会得到这样的输出:
频道[1]:
program_name_1 program_name_2 similarity
party parties 95
animals animal 95
gucci gucci_gucci 75
频道[2]:
program_name_1 program_name_2 similarity
the simpson simpsons 85
cars the car 75
mathematics math 70
这是我请求频道 2 或下一个时得到的输出:
代码:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']
输出:
Channel program_name_1 program_name_2 similarity
2 the simpson NaN NaN
2 cars NaN NaN
2 mathematics NaN NaN
希望有人能帮助我:)
谢谢!
这是为了索引不匹配,在添加第一个数据系列后重置索引可以解决这个问题!
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
print(my_df.index)
my_df.reset_index(inplace=True)
print(my_df.index)
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
我遇到了一个很奇怪的问题: 我有两个 df,我必须通过相似性将一个 df 的字符串与另一个 df 的字符串匹配。 目标列是电视节目的名称 (program_name_1 & program_name_2)。 为了让他从有限的一组数据中进行选择,我还使用了'channel'列作为过滤器。
该函数应用模糊算法并给出 program_name_1 列的元素与 program_name_2 的匹配结果以及它们之间的相似度分数。
真正奇怪的是,输出只对第一个通道有效,但对所有下一个通道都不起作用。仅打印 program_name_1 的第一列 (scorer_test_2) 始终正确,但 scorer_test_2(应打印 program_name_2)和相似性列为 NaN。
我对 dfs 做了很多检查:我确信列的名称与列表中的名称相同,并且在其他通道中,有我要的所有数据.
最奇怪的是第一个通道和所有其他通道都在同一个df中,因此通道数据之间没有差异。
我会告诉你 'toys dts',让你更好地理解问题:
df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])
将为 df1 打印:
Channel program_name_1
1 party
1 animals
1 gucci
2 the simpson
2 cars
2 mathematics
3 bikes
4 chef
对于 df2:
Channel program_name_2
1 parties
1 gucci_gucci
1 animal
2 simpsons
2 math
2 the car
3 bike
4 cooking
这里是代码:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']
# creation of a function for the score
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
print(scorer_tester_function('R').head())
我想为所有频道获得的输出,但如果我通过代码中的第一个频道,我只会得到这样的输出:
频道[1]:
program_name_1 program_name_2 similarity
party parties 95
animals animal 95
gucci gucci_gucci 75
频道[2]:
program_name_1 program_name_2 similarity
the simpson simpsons 85
cars the car 75
mathematics math 70
这是我请求频道 2 或下一个时得到的输出:
代码:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']
输出:
Channel program_name_1 program_name_2 similarity
2 the simpson NaN NaN
2 cars NaN NaN
2 mathematics NaN NaN
希望有人能帮助我:)
谢谢!
这是为了索引不匹配,在添加第一个数据系列后重置索引可以解决这个问题!
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
print(my_df.index)
my_df.reset_index(inplace=True)
print(my_df.index)
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df