计算来自两个不同数据帧的两个字符串列之间的 Levenshtein 距离

Question

我有两个包含相同字符串列（主机名）的数据帧，我想计算两个数据帧之间所有可能的主机名对组合之间的 Levenshtein 距离，并将结果放在第三个数据帧中，我在其中保持距离每个可能的组合以及该组合的两个索引。

例如，假设我有这两个数据帧：

Index      Hostname
85608             dlt-rly-tracker-3.datto.com
9378      lnv7bc4241e2.1528.ozvision.ozsn.net
22791             dlt-rly-tracker-1.datto.com
88922                                 pw-file
94560     lnv7bc4241e2.1528.ozvision.ozsn.net
13245                                       -
63604                                 pw-file
435839                                pw-file
95473                                       -
13856                                 pw-file
210705                                pw-file
30046                                       -
106917            dlt-rly-tracker-2.datto.com
415925                                pw-file
170471                                pw-file
73971                                       -
86885             dlt-rly-tracker-3.datto.com
162764                                pw-file
74791                                 pw-file

和第二个数据帧：

Index     Hostname
93358                  device.dattobackup.com
34067             dlt-rly-tracker-5.datto.com
18083               46.104.89.54.in-addr.arpa
96798                                 pw-file
130940                                pw-file
31476     lnv7bc4241e2.1528.ozvision.ozsn.net
149723                                pw-file
52901                                       -
308834    lnv7bc4241e2.1528.ozvision.ozsn.net
24196                                 pw-file
69038                                       -
244454    lnv7bc4241e2.1528.ozvision.ozsn.net
2867                                        -
45549                        daisy.ubuntu.com
334378                                pw-file
86006               46.104.89.54.in-addr.arpa
430257                                pw-file
86150               46.104.89.54.in-addr.arpa
65189                                 pw-file

我想要做的是获取主机名的第一个值 (dlt-rly-tracker-3.datto.com) 并计算 levenshtein 距离与第二个数据帧中主机名的所有值 (一个接一个) ).将此过程结束时的结果存储在类似于以下内容的新数据框中：

Indexes         Distance    Hostnames
85608-93358     23          dlt-rly-tracker-3.datto.com,device.dattobackup.com
85608-34067     60          dlt-rly-tracker-3.datto.com,dlt-rly-tracker-5.datto.com

非常感谢任何帮助解决我的问题。谢谢

Answer 1

下面的解决方案将循环遍历两个数据框并使用所需数据创建一个新字典。然后，您应该将此字典转换为数据框。让我知道这是否有帮助！

 dist = {}
 for rowname, row in df.iterrows(): 
      for rowname1, row1 in df1.iterrows(): 
            L = Levenstein(row.Hostname, row1.Hostname)
            dist.update( {rowname+’-‘+rowname1 : (L, row.Hostname+’,’+row1.Hostname} )

Answer 2

这是我的解决方案。

import pandas as pd
from nltk import edit_distance

在这里你需要创建你的两个DataFrame。我假设它们被称为：

df1

df2

outputList = []
for rowLeft  in df1.iterrows():
    for rowRight in df2.iterrows():
        indexes = str(rowRight[1][0]) + "-" + str(rowLeft[1][0])
        distance = edit_distance(rowRight[1][1],rowLeft[1][1])
        hostNames = rowRight[1][1] + "-" + rowLeft[1][1]
        outputList.append({"Indexes": indexes, "Distance":distance, "Hostnames":hostNames})

outputDf = pd.DataFrame(outputList)

计算来自两个不同数据帧的两个字符串列之间的 Levenshtein 距离

Compute Levenshtein Distance between two String Columns from two different dataframes

python

similarity

dataframe

levenshtein-distance

pandas