数据框中的差异字符串
diff strings in dataframe
我有一个 pandas 数据框,其中包含与此类似的字符串 (300k)。
index
original
modified
0
This is the original sentence
This is the changed sentence
1
This is a different sentence
This is the same sentence
我想区分字符串。
理想情况下,我会创建第三个,其中包含以下更改:
index
original
modified
change
0
This is the original sentence
This is the changed sentence
original -> changed
1
This is a different sentence
This is the same sentence
a different -> the same
但即使只是能够输出差异也已经很棒了。
我试过了df.applying
difflib.ndiff()
但它输出
<generator object Differ.compare at 0x7ff8121a...
我不确定 difflib
是否有 shrink-wrapped 方法来做你想做的事,但它肯定有一些可以使用的成分。
以下是您可以使用 difflib
(docs) 中的 SequenceMatcher
执行的操作的示例:
records = [
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
import difflib
def getDiff(o, m):
diffStr = ''
sm = difflib.SequenceMatcher(None, o, m)
for tag, i1, i2, j1, j2 in sm.get_opcodes():
diffStr += '\n' if diffStr else ''
diffStr += f'{tag:7} o[{i1}:{i2}] --> m[{j1}:{j2}] {o[i1:i2]!r:>6} --> {m[j1:j2]!r}'
return f"original: {o}\nmodified: {m}\n" + diffStr
out = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
[print(x) for x in out]
输出:
original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence
original: This is the original sentence
modified: This is the changed sentence
equal o[0:12] --> m[0:12] 'This is the ' --> 'This is the '
replace o[12:15] --> m[12:16] 'ori' --> 'chan'
equal o[15:16] --> m[16:17] 'g' --> 'g'
replace o[16:20] --> m[17:19] 'inal' --> 'ed'
equal o[20:29] --> m[19:28] ' sentence' --> ' sentence'
original: This is a different sentence
modified: This is the same sentence
equal o[0:8] --> m[0:8] 'This is ' --> 'This is '
insert o[8:8] --> m[8:13] '' --> 'the s'
equal o[8:9] --> m[13:14] 'a' --> 'a'
replace o[9:14] --> m[14:15] ' diff' --> 'm'
equal o[14:15] --> m[15:16] 'e' --> 'e'
delete o[15:19] --> m[16:16] 'rent' --> ''
equal o[19:28] --> m[16:25] ' sentence' --> ' sentence'
replace
、insert
和 delete
操作码可以帮助您完成您的要求。但是,请注意 "original"
和 "changed"
在 character-by-character 级别进行比较(因此两个单词中的字母 "g"
被检测为未更改),因此可能需要一些时间额外的工作来获得您问题中的确切示例输出。
已更新:
我对此进行了更多考虑(因为它肯定是构建在 difflib
上的一种吸引人的能力)并提出了一种策略,该策略使用 SequenceMatcher
中的 get_op_codes()
来给出确切的“已更改”问题示例中指定的输出。我不知道它会为每个可能的例子给出令人满意的结果,但对于许多问题和解决方案来说也是如此:
records = [
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
import difflib
def getDiff(o, m):
sm = difflib.SequenceMatcher(None, o, m)
oStart, mStart, oEnd, mEnd = None, None, None, None
for tag, i1, i2, j1, j2 in sm.get_opcodes():
if tag != 'equal':
if oStart is None:
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
diffStr = '<no change>' if oStart is None else o[oStart:oEnd] + ' -> ' + m[mStart:mEnd]
return diffStr
df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
print(df)
输出:
original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence
original modified changed
0 This is the original sentence This is the changed sentence original -> changed
1 This is a different sentence This is the same sentence a different -> the same
更新#3:
好的,现在有一个解决方案,将标点符号、白色 space 和字符串边界视为单词定界符,并根据它们是否独立(即由“字符串边界”)。
我添加了一个更复杂的例子来说明它的作用。为了清楚起见,我还将所有结果子字符串用单引号引起来。
records = [
{'original':'This, my good friend, is a very small piece of cake', 'modified':'That, my friend, is a very, very large piece of work'},
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df.to_string(index=False))
import difflib
import string
def isStandalone(x, i1, i2):
puncAndWs = string.punctuation + string.whitespace
while i1 < i2 and x[i1] in puncAndWs:
i1 += 1
while i1 < i2 and x[i2 - 1] in puncAndWs:
i2 -= 1
return (i1 == 0 or x[i1 - 1] in puncAndWs) and (i2 == len(x) or x[i2] in puncAndWs)
def makeDiff(o, m, oStart, oEnd, mStart, mEnd):
oChange = "'" + o[oStart:oEnd] + "'"
mChange = "'" + m[mStart:mEnd] + "'"
return '<no change>' if oStart is None else oChange + ' -> ' + mChange
def getDiff(o, m):
sm = difflib.SequenceMatcher(None, o, m)
diffList = []
oStart, mStart, oEnd, mEnd = None, None, None, None
for tag, i1, i2, j1, j2 in sm.get_opcodes():
bothStandalone = isStandalone(o, i1, i2) and isStandalone(m, j1, j2)
if bothStandalone:
if oStart is not None:
if tag == 'equal':
diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
oStart, mStart, oEnd, mEnd = None, None, None, None
else:
oEnd, mEnd = i2, j2
elif tag != 'equal':
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
elif oStart is not None:
oEnd, mEnd = i2, j2
else:
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
if oStart is not None:
diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
diffStr = ', '.join(diffList)
return diffStr
df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
#print(df.to_string(index=False))
df.drop(['original', 'modified'], axis=1, inplace=True)
print(df.to_string(index=False))
输出:
original modified
This, my good friend, is a very small piece of cake That, my friend, is a very, very large piece of work
This is the original sentence This is the changed sentence
This is a different sentence This is the same sentence
changed
'This' -> 'That', ' good' -> '', ' small' -> ', very large', 'cake' -> 'work'
'original' -> 'changed'
'a different' -> 'the same'
我有一个 pandas 数据框,其中包含与此类似的字符串 (300k)。
index | original | modified |
---|---|---|
0 | This is the original sentence | This is the changed sentence |
1 | This is a different sentence | This is the same sentence |
我想区分字符串。 理想情况下,我会创建第三个,其中包含以下更改:
index | original | modified | change |
---|---|---|---|
0 | This is the original sentence | This is the changed sentence | original -> changed |
1 | This is a different sentence | This is the same sentence | a different -> the same |
但即使只是能够输出差异也已经很棒了。
我试过了df.applying
difflib.ndiff()
但它输出
<generator object Differ.compare at 0x7ff8121a...
我不确定 difflib
是否有 shrink-wrapped 方法来做你想做的事,但它肯定有一些可以使用的成分。
以下是您可以使用 difflib
(docs) 中的 SequenceMatcher
执行的操作的示例:
records = [
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
import difflib
def getDiff(o, m):
diffStr = ''
sm = difflib.SequenceMatcher(None, o, m)
for tag, i1, i2, j1, j2 in sm.get_opcodes():
diffStr += '\n' if diffStr else ''
diffStr += f'{tag:7} o[{i1}:{i2}] --> m[{j1}:{j2}] {o[i1:i2]!r:>6} --> {m[j1:j2]!r}'
return f"original: {o}\nmodified: {m}\n" + diffStr
out = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
[print(x) for x in out]
输出:
original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence
original: This is the original sentence
modified: This is the changed sentence
equal o[0:12] --> m[0:12] 'This is the ' --> 'This is the '
replace o[12:15] --> m[12:16] 'ori' --> 'chan'
equal o[15:16] --> m[16:17] 'g' --> 'g'
replace o[16:20] --> m[17:19] 'inal' --> 'ed'
equal o[20:29] --> m[19:28] ' sentence' --> ' sentence'
original: This is a different sentence
modified: This is the same sentence
equal o[0:8] --> m[0:8] 'This is ' --> 'This is '
insert o[8:8] --> m[8:13] '' --> 'the s'
equal o[8:9] --> m[13:14] 'a' --> 'a'
replace o[9:14] --> m[14:15] ' diff' --> 'm'
equal o[14:15] --> m[15:16] 'e' --> 'e'
delete o[15:19] --> m[16:16] 'rent' --> ''
equal o[19:28] --> m[16:25] ' sentence' --> ' sentence'
replace
、insert
和 delete
操作码可以帮助您完成您的要求。但是,请注意 "original"
和 "changed"
在 character-by-character 级别进行比较(因此两个单词中的字母 "g"
被检测为未更改),因此可能需要一些时间额外的工作来获得您问题中的确切示例输出。
已更新:
我对此进行了更多考虑(因为它肯定是构建在 difflib
上的一种吸引人的能力)并提出了一种策略,该策略使用 SequenceMatcher
中的 get_op_codes()
来给出确切的“已更改”问题示例中指定的输出。我不知道它会为每个可能的例子给出令人满意的结果,但对于许多问题和解决方案来说也是如此:
records = [
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
import difflib
def getDiff(o, m):
sm = difflib.SequenceMatcher(None, o, m)
oStart, mStart, oEnd, mEnd = None, None, None, None
for tag, i1, i2, j1, j2 in sm.get_opcodes():
if tag != 'equal':
if oStart is None:
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
diffStr = '<no change>' if oStart is None else o[oStart:oEnd] + ' -> ' + m[mStart:mEnd]
return diffStr
df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
print(df)
输出:
original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence
original modified changed
0 This is the original sentence This is the changed sentence original -> changed
1 This is a different sentence This is the same sentence a different -> the same
更新#3: 好的,现在有一个解决方案,将标点符号、白色 space 和字符串边界视为单词定界符,并根据它们是否独立(即由“字符串边界”)。
我添加了一个更复杂的例子来说明它的作用。为了清楚起见,我还将所有结果子字符串用单引号引起来。
records = [
{'original':'This, my good friend, is a very small piece of cake', 'modified':'That, my friend, is a very, very large piece of work'},
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df.to_string(index=False))
import difflib
import string
def isStandalone(x, i1, i2):
puncAndWs = string.punctuation + string.whitespace
while i1 < i2 and x[i1] in puncAndWs:
i1 += 1
while i1 < i2 and x[i2 - 1] in puncAndWs:
i2 -= 1
return (i1 == 0 or x[i1 - 1] in puncAndWs) and (i2 == len(x) or x[i2] in puncAndWs)
def makeDiff(o, m, oStart, oEnd, mStart, mEnd):
oChange = "'" + o[oStart:oEnd] + "'"
mChange = "'" + m[mStart:mEnd] + "'"
return '<no change>' if oStart is None else oChange + ' -> ' + mChange
def getDiff(o, m):
sm = difflib.SequenceMatcher(None, o, m)
diffList = []
oStart, mStart, oEnd, mEnd = None, None, None, None
for tag, i1, i2, j1, j2 in sm.get_opcodes():
bothStandalone = isStandalone(o, i1, i2) and isStandalone(m, j1, j2)
if bothStandalone:
if oStart is not None:
if tag == 'equal':
diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
oStart, mStart, oEnd, mEnd = None, None, None, None
else:
oEnd, mEnd = i2, j2
elif tag != 'equal':
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
elif oStart is not None:
oEnd, mEnd = i2, j2
else:
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
if oStart is not None:
diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
diffStr = ', '.join(diffList)
return diffStr
df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
#print(df.to_string(index=False))
df.drop(['original', 'modified'], axis=1, inplace=True)
print(df.to_string(index=False))
输出:
original modified
This, my good friend, is a very small piece of cake That, my friend, is a very, very large piece of work
This is the original sentence This is the changed sentence
This is a different sentence This is the same sentence
changed
'This' -> 'That', ' good' -> '', ' small' -> ', very large', 'cake' -> 'work'
'original' -> 'changed'
'a different' -> 'the same'