python3、两个字符串的区别
python 3, differences between two strings
我想记录列表中两个字符串的差异位置(以删除它们)...最好记录每个部分的最高分隔点,因为这些区域将具有动态内容。
比较这些
总字符数 178。两个独特的部分
t1 = 'WhereTisthetotalnumberofght5y5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although'
和
总字符数 211。两个独特的部分
t2 = 'WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although'
我知道 difflib 可以做到这一点,但输出很糟糕。
我想存储(在列表中)字符位置,最好是较大的分隔值。
模式位置
t1 = 'WhereTisthetotalnumberof 24 ght5y5wsjhhhhjhkmhm 43 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofap 151 xxxxxxx 158 proximation,although'
t2 = 'WhereTisthetotalnumberof 24 dofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs 76 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentre 155 xxxxxxx 162 sultsduetodifferinglevelsofapproximation,although'
输出:
output list = [24, 76, 151, 162]
更新
回应post @Olivier
由 *** 分隔的所有 Y 的位置
t1
WhereTisthetotalnumberofght5***y***5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although
t2 WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssugu***y***gui***y***gis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although
matcher.get_matching_blocks()后的输出
和 string = ''.join([t1[a:a+n] for a, _, n in blocks])
WhereTisthetotalnumberof***y*** Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapproximation,although
使用 difflib
可能是您最好的选择,因为您不太可能想出比它提供的算法更有效的解决方案。你想要的是使用SequenceMatcher.get_matching_blocks
。这是它根据 doc.
输出的内容
Return list of triples describing matching subsequences. Each triple
is of the form (i, j, n)
, and means that a[i:i+n] == b[j:j+n]
. The
triples are monotonically increasing in i and j.
这是一种方法,您可以使用它来重建已删除增量的字符串。
from difflib import SequenceMatcher
x = "abc_def"
y = "abc--ef"
matcher = SequenceMatcher(None, x, y)
blocks = matcher.get_matching_blocks()
# blocks: [Match(a=0, b=0, size=4), Match(a=5, b=5, size=2), Match(a=7, b=7, size=0)]
string = ''.join([x[a:a+n] for a, _, n in blocks])
# string: "abcef"
编辑:还指出,如果您有两个这样的字符串。
t1 = 'WordWordaayaaWordWord'
t2 = 'WordWordbbbybWordWord'
那么上面的代码会return'WordWordyWordWord
。这是因为 get_matching_blocks
将捕获出现在预期块之间的两个字符串中的 'y'
。一个解决方案是按长度过滤 returned 块。
string = ''.join([x[a:a+n] for a, _, n in blocks if n > 1])
如果您想对 returned 块进行更复杂的分析,您还可以执行以下操作。
def block_filter(substring):
"""Outputs True if the substring is to be merged, False otherwise"""
...
string = ''.join([x[a:a+n] for a, _, n in blocks if block_filter(x[a:a+n])])
我想记录列表中两个字符串的差异位置(以删除它们)...最好记录每个部分的最高分隔点,因为这些区域将具有动态内容。
比较这些
总字符数 178。两个独特的部分
t1 = 'WhereTisthetotalnumberofght5y5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although'
和
总字符数 211。两个独特的部分
t2 = 'WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although'
我知道 difflib 可以做到这一点,但输出很糟糕。
我想存储(在列表中)字符位置,最好是较大的分隔值。
模式位置
t1 = 'WhereTisthetotalnumberof 24 ght5y5wsjhhhhjhkmhm 43 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofap 151 xxxxxxx 158 proximation,although'
t2 = 'WhereTisthetotalnumberof 24 dofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs 76 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentre 155 xxxxxxx 162 sultsduetodifferinglevelsofapproximation,although'
输出:
output list = [24, 76, 151, 162]
更新
回应post @Olivier
由 *** 分隔的所有 Y 的位置
t1
WhereTisthetotalnumberofght5***y***5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although
t2 WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssugu***y***gui***y***gis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although
matcher.get_matching_blocks()后的输出
和 string = ''.join([t1[a:a+n] for a, _, n in blocks])
WhereTisthetotalnumberof***y*** Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapproximation,although
使用 difflib
可能是您最好的选择,因为您不太可能想出比它提供的算法更有效的解决方案。你想要的是使用SequenceMatcher.get_matching_blocks
。这是它根据 doc.
Return list of triples describing matching subsequences. Each triple is of the form
(i, j, n)
, and means thata[i:i+n] == b[j:j+n]
. The triples are monotonically increasing in i and j.
这是一种方法,您可以使用它来重建已删除增量的字符串。
from difflib import SequenceMatcher
x = "abc_def"
y = "abc--ef"
matcher = SequenceMatcher(None, x, y)
blocks = matcher.get_matching_blocks()
# blocks: [Match(a=0, b=0, size=4), Match(a=5, b=5, size=2), Match(a=7, b=7, size=0)]
string = ''.join([x[a:a+n] for a, _, n in blocks])
# string: "abcef"
编辑:还指出,如果您有两个这样的字符串。
t1 = 'WordWordaayaaWordWord'
t2 = 'WordWordbbbybWordWord'
那么上面的代码会return'WordWordyWordWord
。这是因为 get_matching_blocks
将捕获出现在预期块之间的两个字符串中的 'y'
。一个解决方案是按长度过滤 returned 块。
string = ''.join([x[a:a+n] for a, _, n in blocks if n > 1])
如果您想对 returned 块进行更复杂的分析,您还可以执行以下操作。
def block_filter(substring):
"""Outputs True if the substring is to be merged, False otherwise"""
...
string = ''.join([x[a:a+n] for a, _, n in blocks if block_filter(x[a:a+n])])