Python:fuzzywuzzy,输出第一个值是正确的,其他都是NaN

Python: fuzzywuzzy, the output of the first value is correct, the others are NaN

我遇到了一个很奇怪的问题: 我有两个 df,我必须通过相似性将一个 df 的字符串与另一个 df 的字符串匹配。 目标列是电视节目的名称 (program_name_1 & program_name_2)。 为了让他从有限的一组数据中进行选择,我还使用了'channel'列作为过滤器。

该函数应用模糊算法并给出 program_name_1 列的元素与 program_name_2 的匹配结果以及它们之间的相似度分数。

真正奇怪的是,输出只对第一个通道有效,但对所有下一个通道都不起作用。仅打印 program_name_1 的第一列 (scorer_test_2) 始终正确,但 scorer_test_2(应打印 program_name_2)和相似性列为 NaN。

我对 dfs 做了很多检查:我确信列的名称与列表中的名称相同,并且在其他通道中,有我要的所有数据.

最奇怪的是第一个通道和所有其他通道都在同一个df中,因此通道数据之间没有差异。

我会告诉你 'toys dts',让你更好地理解问题:

df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])

将为 df1 打印:

  Channel program_name_1
       1          party
       1        animals
       1          gucci
       2    the simpson
       2           cars
       2    mathematics
       3          bikes
       4           chef

对于 df2:

  Channel program_name_2
       1        parties
       1    gucci_gucci
       1         animal
       2       simpsons
       2           math
       2        the car
       3           bike
       4        cooking

这里是代码:

scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']

# creation of a function for the score
def scorer_tester_function(x):
    matching_list = []
    similarity = []
    # iterate on the rows
    for i in scorer_test_1:
        if pd.isnull(i):
            matching_list.append(np.null)
            similarity.append(np.null)
        else:
            ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
            matching_list.append(ratio[0][0])
            similarity.append(ratio[0][1])
    my_df = pd.DataFrame()
    my_df['program_name_1'] = scorer_test_1
    my_df['program_name_2'] = pd.Series(matching_list)
    my_df['similarity'] = pd.Series(similarity)

    return my_df

print(scorer_tester_function('R').head())

我想为所有频道获得的输出,但如果我通过代码中的第一个频道,我只会得到这样的输出:

频道[1]:

program_name_1 program_name_2 similarity
    party          parties        95
    animals        animal         95
    gucci        gucci_gucci      75

频道[2]:

  program_name_1 program_name_2 similarity
   the simpson     simpsons        85
      cars          the car        75
   mathematics       math          70

这是我请求频道 2 或下一个时得到的输出:

代码:

scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']

输出:

  Channel program_name_1 program_name_2 similarity
     2     the simpson        NaN           NaN
     2        cars            NaN           NaN
     2    mathematics         NaN           NaN

希望有人能帮助我:)

谢谢!

这是为了索引不匹配,在添加第一个数据系列后重置索引可以解决这个问题!

def scorer_tester_function(x):
    matching_list = []
    similarity = []
    # iterate on the rows
    for i in scorer_test_1:
        if pd.isnull(i):
            matching_list.append(np.null)
            similarity.append(np.null)
        else:
            ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
            matching_list.append(ratio[0][0])
            similarity.append(ratio[0][1])
    my_df = pd.DataFrame()
    my_df['program_name_1'] = scorer_test_1
    print(my_df.index)
    my_df.reset_index(inplace=True)
    print(my_df.index)
    my_df['program_name_2'] = pd.Series(matching_list)
    my_df['similarity'] = pd.Series(similarity)

    return my_df