逐行查找两个数据帧之间的相似性

Question

我有两个数据框 df1 和 df2，它们具有相同的列。我想找到这两个数据集之间的相似性。我一直在遵循这两种方法中的一种。第一个是将两个数据帧中的一个附加到另一个数据帧并选择重复项：

df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)

df[mask] # it gives me duplicated rows

第二个是考虑一个阈值，对于 df1 中的每一行，它在 df2 中的行中找到一个潜在的匹配项。

数据样本：请注意数据集的长度不同

对于df1

Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football

和 df2

Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy? 
Trump did not win the last presidential election

为此，我使用了以下函数：

def check(df1, thres, col):
    matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
    return [df1. Check[i] for i, x in enumerate(matches) if x]

这应该能让我找到匹配的行。

第二种方法（我最感兴趣的一种）的问题是它实际上没有考虑到两个数据帧。

我对第一个函数的期望值是两个数据帧，一个用于 df1，一个用于 df2，有一个额外的列，其中包含每行与另一个数据帧中的相似性；从第二个代码中，我应该只为它们分配一个相似度值（我应该有与行数一样多的列）。

如果您需要更多信息或需要更多代码，请告诉我。也许有更简单的方法来确定这种相似性，但遗憾的是我还没有找到。

欢迎提出任何建议。

预期输出：

（这是一个例子；因为我设置了一个阈值，输出可能会改变）

df1

Check                             sim
how to join to first row         []
large data work flows            [small data work flows]
I have two dataframes            []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces          [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here?        []
add language identifier          []
my dad loves watching football   []

df2

Check                             sim
small data work flows                [large data work flows]
I have tried to puzze out an answer   []
mix grammatical or spelling errors    [fix grammatical or spelling errors]
indent code by 2 spaces               [indent code by 4 spaces]
indent code by 8 spaces               [indent code by 4 spaces]
put returns between paragraphs        []
add curry on the chicken curry        []
mom!! mom!! mom!!                     []
create code fences with backticks     []
are you crazy?                        []
Trump did not win the last presidential election    []

Answer 1

我认为您的 fuzzywuzzy 解决方案非常好。我已经实现了你在下面寻找的东西。这将增长为 len(df1)*len(df2) 所以非常占用内存，但至少应该相当清楚。您可能会发现 gen_scores 的输出也很有用。

from fuzzywuzzy import fuzz 
from itertools import product

def gen_scores(df1, df2):
    # generates a score for all row combinations between dfs
    df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
    df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
    return df_score

def get_matches(df1, df2, thresh=0.5):
    # get all matches above a threshold, appended as list to each df
    df = gen_scores(df1, df2)
    df = df[df.score > thresh]

    matches = df.groupby("c1").c2.apply(list)
    df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
    df1 = df1.rename(columns={"c2":"matches"})
    df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])

    matches = df.groupby("c2").c1.apply(list)
    df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
    df2 = df2.rename(columns={"c1":"matches"})
    df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
    return (df1, df2)

# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)

输出：

                                               Check                                            matches
0                           how to join to first row                                                 []
1                              large data work flows                            [small data work flows]
2                              I have two dataframes                                                 []
3  fix grammatical or spelling errors [mix gramma...               [mix grammatical or spelling errors]
4                            indent code by 4 spaces  [indent code by 2 spaces, indent code by 8 spa...
5                          why are you posting here?                                   [are you crazy?]
6                            add language identifier                                                 []
7                     my dad loves watching football                                                 []

逐行查找两个数据帧之间的相似性

Find similarity between two dataframes, row by row

python

similarity

pandas

fuzzywuzzy