使用正则表达式和 python 发现相同相邻的字符串

Question

考虑这段文字：

...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,

genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast

beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow

(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)

beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...

我想用python解析这段文字，只保留恰好出现两次且相邻的字符串。例如，可接受的结果应该是

bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne

因为趋势是每个字符串都与相同的字符串相邻，就像这样：

bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne

那么如何使用正则表达式搜索相邻且相同的字符串呢？我正在测试我的试验 here。谢谢！

Answer 1

您可以使用以下正则表达式：

(\b.+)

见demo

或者，只匹配并捕获唯一的子字符串部分：

(\b.+)(?=)

Another demo

单词边界\b确保我们只匹配单词的开头，然后匹配换行符以外的1个或多个字符（在单行模式下，.也会匹配一个换行符），然后在 backreference 的帮助下，我们匹配与 (\b.+).

捕获的完全相同的字符序列

使用带有 (?=) 前瞻的版本时，匹配的文本不包含重复部分，因为前瞻不消耗文本并且匹配不包含那些块。

更新

见Python demo:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
    print i.group(1).encode('utf-8')

输出：

zyme
abbrühen

使用正则表达式和 python 发现相同相邻的字符串

Discover identically adjacent strings with regex and python

python

regex

regex-negation

regex-lookarounds