使用正则表达式和 python 发现相同相邻的字符串
Discover identically adjacent strings with regex and python
考虑这段文字:
...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,
genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast
beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow
(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)
beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...
我想用python解析这段文字,只保留恰好出现两次且相邻的字符串。例如,可接受的结果应该是
bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne
因为趋势是每个字符串都与相同的字符串相邻,就像这样:
bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne
那么如何使用正则表达式搜索相邻且相同的字符串呢?我正在测试我的试验 here。谢谢!
您可以使用以下正则表达式:
(\b.+)
见demo
或者,只匹配并捕获唯一的子字符串部分:
(\b.+)(?=)
单词边界\b
确保我们只匹配单词的开头,然后匹配换行符以外的1个或多个字符(在单行模式下,.
也会匹配一个换行符),然后在 backreference 的帮助下,我们匹配与 (\b.+)
.
捕获的完全相同的字符序列
使用带有 (?=)
前瞻的版本时,匹配的文本不包含重复部分,因为前瞻不消耗文本并且匹配不包含那些块。
更新
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
print i.group(1).encode('utf-8')
输出:
zyme
abbrühen
考虑这段文字:
...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,
genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast
beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow
(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)
beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...
我想用python解析这段文字,只保留恰好出现两次且相邻的字符串。例如,可接受的结果应该是
bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne
因为趋势是每个字符串都与相同的字符串相邻,就像这样:
bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne
那么如何使用正则表达式搜索相邻且相同的字符串呢?我正在测试我的试验 here。谢谢!
您可以使用以下正则表达式:
(\b.+)
见demo
或者,只匹配并捕获唯一的子字符串部分:
(\b.+)(?=)
单词边界\b
确保我们只匹配单词的开头,然后匹配换行符以外的1个或多个字符(在单行模式下,.
也会匹配一个换行符),然后在 backreference 的帮助下,我们匹配与 (\b.+)
.
使用带有 (?=)
前瞻的版本时,匹配的文本不包含重复部分,因为前瞻不消耗文本并且匹配不包含那些块。
更新
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
print i.group(1).encode('utf-8')
输出:
zyme
abbrühen