使用正则表达式和 python 发现相同相邻的字符串

Discover identically adjacent strings with regex and python

考虑这段文字:

...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,

genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast

beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow

(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)

beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...

我想用python解析这段文字,只保留恰好出现两次且相邻的字符串。例如,可接受的结果应该是

bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne

因为趋势是每个字符串都与相同的字符串相邻,就像这样:

bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne

那么如何使用正则表达式搜索相邻且相同的字符串呢?我正在测试我的试验 here。谢谢!

您可以使用以下正则表达式:

(\b.+)

demo

或者,只匹配并捕获唯一的子字符串部分:

(\b.+)(?=)

Another demo

单词边界\b确保我们只匹配单词的开头,然后匹配换行符以外的1个或多个字符(在单行模式下,.也会匹配一个换行符),然后在 backreference 的帮助下,我们匹配与 (\b.+).

捕获的完全相同的字符序列

使用带有 (?=) 前瞻的版本时,匹配的文本不包含重复部分,因为前瞻不消耗文本并且匹配不包含那些块。

更新

Python demo:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
    print i.group(1).encode('utf-8')

输出:

zyme
abbrühen