数据框中的差异字符串

diff strings in dataframe

我有一个 pandas 数据框,其中包含与此类似的字符串 (300k)。

index original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence

我想区分字符串。 理想情况下,我会创建第三个,其中包含以下更改:

index original modified change
0 This is the original sentence This is the changed sentence original -> changed
1 This is a different sentence This is the same sentence a different -> the same

但即使只是能够输出差异也已经很棒了。

我试过了df.applying

difflib.ndiff()

但它输出

<generator object Differ.compare at 0x7ff8121a...

我不确定 difflib 是否有 shrink-wrapped 方法来做你想做的事,但它肯定有一些可以使用的成分。

以下是您可以使用 difflib (docs) 中的 SequenceMatcher 执行的操作的示例:

records = [
    {'original':'This is the original sentence', 'modified':'This is the changed sentence'},
    {'original':'This is a different sentence', 'modified':'This is the same sentence'}
]

import pandas as pd
df = pd.DataFrame(records)
print(df)

import difflib
def getDiff(o, m):
    diffStr = ''
    sm = difflib.SequenceMatcher(None, o, m)
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        diffStr += '\n' if diffStr else ''
        diffStr += f'{tag:7} o[{i1}:{i2}] --> m[{j1}:{j2}] {o[i1:i2]!r:>6} --> {m[j1:j2]!r}'
    return f"original: {o}\nmodified: {m}\n" + diffStr


out = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
[print(x) for x in out]

输出:

                        original                      modified
0  This is the original sentence  This is the changed sentence
1   This is a different sentence     This is the same sentence
original: This is the original sentence
modified: This is the changed sentence
equal   o[0:12] --> m[0:12] 'This is the ' --> 'This is the '
replace o[12:15] --> m[12:16]  'ori' --> 'chan'
equal   o[15:16] --> m[16:17]    'g' --> 'g'
replace o[16:20] --> m[17:19] 'inal' --> 'ed'
equal   o[20:29] --> m[19:28] ' sentence' --> ' sentence'
original: This is a different sentence
modified: This is the same sentence
equal   o[0:8] --> m[0:8] 'This is ' --> 'This is '
insert  o[8:8] --> m[8:13]     '' --> 'the s'
equal   o[8:9] --> m[13:14]    'a' --> 'a'
replace o[9:14] --> m[14:15] ' diff' --> 'm'
equal   o[14:15] --> m[15:16]    'e' --> 'e'
delete  o[15:19] --> m[16:16] 'rent' --> ''
equal   o[19:28] --> m[16:25] ' sentence' --> ' sentence'

replaceinsertdelete 操作码可以帮助您完成您的要求。但是,请注意 "original""changed" 在 character-by-character 级别进行比较(因此两个单词中的字母 "g" 被检测为未更改),因此可能需要一些时间额外的工作来获得您问题中的确切示例输出。

已更新: 我对此进行了更多考虑(因为它肯定是构建在 difflib 上的一种吸引人的能力)并提出了一种策略,该策略使用 SequenceMatcher 中的 get_op_codes() 来给出确切的“已更改”问题示例中指定的输出。我不知道它会为每个可能的例子给出令人满意的结果,但对于许多问题和解决方案来说也是如此:

records = [
    {'original':'This is the original sentence', 'modified':'This is the changed sentence'},
    {'original':'This is a different sentence', 'modified':'This is the same sentence'}
]

import pandas as pd
df = pd.DataFrame(records)
print(df)

import difflib
def getDiff(o, m):
    sm = difflib.SequenceMatcher(None, o, m)
    oStart, mStart, oEnd, mEnd = None, None, None, None
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        if tag != 'equal':
            if oStart is None:
                oStart, mStart = i1, j1
            oEnd, mEnd = i2, j2
    diffStr = '<no change>' if oStart is None else o[oStart:oEnd] + ' -> ' + m[mStart:mEnd]
    return diffStr

df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
print(df)

输出:

                        original                      modified
0  This is the original sentence  This is the changed sentence
1   This is a different sentence     This is the same sentence
                        original                      modified                  changed
0  This is the original sentence  This is the changed sentence      original -> changed
1   This is a different sentence     This is the same sentence  a different -> the same

更新#3: 好的,现在有一个解决方案,将标点符号、白色 space 和字符串边界视为单词定界符,并根据它们是否独立(即由“字符串边界”)。

我添加了一个更复杂的例子来说明它的作用。为了清楚起见,我还将所有结果子字符串用单引号引起来。

records = [
    {'original':'This, my good friend, is a very small piece of cake', 'modified':'That, my friend, is a very, very large piece of work'},
    {'original':'This is the original sentence', 'modified':'This is the changed sentence'},
    {'original':'This is a different sentence', 'modified':'This is the same sentence'}
]

import pandas as pd
df = pd.DataFrame(records)
print(df.to_string(index=False))

import difflib
import string
def isStandalone(x, i1, i2):
    puncAndWs = string.punctuation + string.whitespace
    while i1 < i2 and x[i1] in puncAndWs:
        i1 += 1
    while i1 < i2 and x[i2 - 1] in puncAndWs:
        i2 -= 1
    return (i1 == 0 or x[i1 - 1] in puncAndWs) and (i2 == len(x) or x[i2] in puncAndWs)
def makeDiff(o, m, oStart, oEnd, mStart, mEnd):
    oChange = "'" + o[oStart:oEnd] + "'"
    mChange = "'" + m[mStart:mEnd] + "'"
    return '<no change>' if oStart is None else oChange + ' -> ' + mChange
def getDiff(o, m):
    sm = difflib.SequenceMatcher(None, o, m)
    diffList = []
    oStart, mStart, oEnd, mEnd = None, None, None, None
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        bothStandalone = isStandalone(o, i1, i2) and isStandalone(m, j1, j2)
        if bothStandalone:
            if oStart is not None:
                if tag == 'equal':
                    diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
                    oStart, mStart, oEnd, mEnd = None, None, None, None
                else:
                    oEnd, mEnd = i2, j2
            elif tag != 'equal':
                oStart, mStart = i1, j1
                oEnd, mEnd = i2, j2
        elif oStart is not None:
            oEnd, mEnd = i2, j2
        else:
            oStart, mStart = i1, j1
            oEnd, mEnd = i2, j2
    if oStart is not None:
        diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
    diffStr = ', '.join(diffList)
    return diffStr

df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
#print(df.to_string(index=False))

df.drop(['original', 'modified'], axis=1, inplace=True)
print(df.to_string(index=False))

输出:

                                           original                                             modified
This, my good friend, is a very small piece of cake That, my friend, is a very, very large piece of work
                      This is the original sentence                         This is the changed sentence
                       This is a different sentence                            This is the same sentence
                                                                      changed
'This' -> 'That', ' good' -> '', ' small' -> ', very large', 'cake' -> 'work'
                                                      'original' -> 'changed'
                                                  'a different' -> 'the same'