Python 2.7 findall()函数排除正则表达式查找
Python 2.7 findall() function to exclude finding for regular expression
我正在尝试编写正则表达式来查找 HTML 中的特定数据。
例如我有
'Leicester City'
'Tottenham Hotspur'
'Arsenal FC'
'Manchester City'
'Manchester United'
'Southampton FC'
'West Ham United'
'Liverpool FC'
'Chelsea FC'
'Stoke City'
'Swansea City'
'Everton FC'
'Watford FC'
'Crystal Palace'
'West Bromwich Albion'
'AFC Bournemouth'
'Sunderland AFC'
'Newcastle United'
'Norwich City'
'Aston Villa'
'Channel Boleyn emotion'
'Channel Boleyn emotion'
但是我不想包含'Channel Boleyn emotion'
,如何排除'emotion'
字符串?
这里是URLhttp://www.worldfootball.net/schedule/eng-premier-league-2015-2016-spieltag/37/
import urllib2
from re import findall
from urllib import urlopen
response = urllib2.urlopen("http://www.worldfootball.net/schedule/eng-premier-league-2015-2016-spieltag/37/")
html_bytes = response.read()
html = html_bytes.decode('utf-8')
ranking= findall('[e]="(\w* ?\w* ?\w*)', html)
print ranking()
[e]="(\w* ?\w* ?\w*),
代码还不能用,(我是新手)但我只想摆脱 'Channel Boleyn emotion' 这样我就可以走得更远。谢谢
您需要使用否定先行断言。
^\w+(?: \w+)*(?<!\bemotion)$
(?!\bemotion)$
断言 emotion
这个词在最后不存在。
或
^\w+(?: \w+)*(?<!\semotion)$
或
>>> s = [
'Leicester City',
'Tottenham Hotspur',
'Arsenal FC',
'Manchester City',
'Manchester United',
'Southampton FC',
'West Ham United',
'Liverpool FC',
'Chelsea FC',
'Stoke City',
'Swansea City',
'Everton FC',
'Watford FC',
'Crystal Palace',
'West Bromwich Albion',
'AFC Bournemouth',
'Sunderland AFC',
'Newcastle United',
'Norwich City',
'Aston Villa',
'Channel Boleyn emotion',
'Channel Boleyn emotion']
>>> [i for i in s if i.split()[-1] != 'emotion']
['Leicester City', 'Tottenham Hotspur', 'Arsenal FC', 'Manchester City', 'Manchester United', 'Southampton FC', 'West Ham United', 'Liverpool FC', 'Chelsea FC', 'Stoke City', 'Swansea City', 'Everton FC', 'Watford FC', 'Crystal Palace', 'West Bromwich Albion', 'AFC Bournemouth', 'Sunderland AFC', 'Newcastle United', 'Norwich City', 'Aston Villa']
您可以使用模式 title="([^"]*)"></a>
。这将查找具有相同标题和文本的链接。
>>> print findall(r'title="([^"]*)"></a>', html)
[u'Norwich City', u'Manchester United', u'AFC Bournemouth', u'West Bromwich Albion', u'Aston Villa', u'Newcastle United', u'Crystal Palace', u'Stoke City', u'Sunderland AFC', u'Chelsea FC', u'West Ham United', u'Swansea City', u'Leicester City', u'Everton FC', u'Tottenham Hotspur', u'Southampton FC', u'Liverpool FC', u'Watford FC', u'Manchester City', u'Arsenal FC', u'Leicester City', u'Leicester City', u'Tottenham Hotspur', u'Tottenham Hotspur', u'Arsenal FC', u'Arsenal FC', u'Manchester City', u'Manchester City', u'Manchester United', u'Manchester United', u'Southampton FC', u'Southampton FC', u'West Ham United', u'West Ham United', u'Liverpool FC', u'Liverpool FC', u'Chelsea FC', u'Chelsea FC', u'Stoke City', u'Stoke City', u'Swansea City', u'Swansea City', u'Everton FC', u'Everton FC', u'Watford FC', u'Watford FC', u'Crystal Palace', u'Crystal Palace', u'West Bromwich Albion', u'West Bromwich Albion', u'AFC Bournemouth', u'AFC Bournemouth', u'Sunderland AFC', u'Sunderland AFC', u'Newcastle United', u'Newcastle United', u'Norwich City', u'Norwich City', u'Aston Villa', u'Aston Villa']
我正在尝试编写正则表达式来查找 HTML 中的特定数据。 例如我有
'Leicester City'
'Tottenham Hotspur'
'Arsenal FC'
'Manchester City'
'Manchester United'
'Southampton FC'
'West Ham United'
'Liverpool FC'
'Chelsea FC'
'Stoke City'
'Swansea City'
'Everton FC'
'Watford FC'
'Crystal Palace'
'West Bromwich Albion'
'AFC Bournemouth'
'Sunderland AFC'
'Newcastle United'
'Norwich City'
'Aston Villa'
'Channel Boleyn emotion'
'Channel Boleyn emotion'
但是我不想包含'Channel Boleyn emotion'
,如何排除'emotion'
字符串?
这里是URLhttp://www.worldfootball.net/schedule/eng-premier-league-2015-2016-spieltag/37/
import urllib2
from re import findall
from urllib import urlopen
response = urllib2.urlopen("http://www.worldfootball.net/schedule/eng-premier-league-2015-2016-spieltag/37/")
html_bytes = response.read()
html = html_bytes.decode('utf-8')
ranking= findall('[e]="(\w* ?\w* ?\w*)', html)
print ranking()
[e]="(\w* ?\w* ?\w*),
代码还不能用,(我是新手)但我只想摆脱 'Channel Boleyn emotion' 这样我就可以走得更远。谢谢
您需要使用否定先行断言。
^\w+(?: \w+)*(?<!\bemotion)$
(?!\bemotion)$
断言 emotion
这个词在最后不存在。
或
^\w+(?: \w+)*(?<!\semotion)$
或
>>> s = [
'Leicester City',
'Tottenham Hotspur',
'Arsenal FC',
'Manchester City',
'Manchester United',
'Southampton FC',
'West Ham United',
'Liverpool FC',
'Chelsea FC',
'Stoke City',
'Swansea City',
'Everton FC',
'Watford FC',
'Crystal Palace',
'West Bromwich Albion',
'AFC Bournemouth',
'Sunderland AFC',
'Newcastle United',
'Norwich City',
'Aston Villa',
'Channel Boleyn emotion',
'Channel Boleyn emotion']
>>> [i for i in s if i.split()[-1] != 'emotion']
['Leicester City', 'Tottenham Hotspur', 'Arsenal FC', 'Manchester City', 'Manchester United', 'Southampton FC', 'West Ham United', 'Liverpool FC', 'Chelsea FC', 'Stoke City', 'Swansea City', 'Everton FC', 'Watford FC', 'Crystal Palace', 'West Bromwich Albion', 'AFC Bournemouth', 'Sunderland AFC', 'Newcastle United', 'Norwich City', 'Aston Villa']
您可以使用模式 title="([^"]*)"></a>
。这将查找具有相同标题和文本的链接。
>>> print findall(r'title="([^"]*)"></a>', html)
[u'Norwich City', u'Manchester United', u'AFC Bournemouth', u'West Bromwich Albion', u'Aston Villa', u'Newcastle United', u'Crystal Palace', u'Stoke City', u'Sunderland AFC', u'Chelsea FC', u'West Ham United', u'Swansea City', u'Leicester City', u'Everton FC', u'Tottenham Hotspur', u'Southampton FC', u'Liverpool FC', u'Watford FC', u'Manchester City', u'Arsenal FC', u'Leicester City', u'Leicester City', u'Tottenham Hotspur', u'Tottenham Hotspur', u'Arsenal FC', u'Arsenal FC', u'Manchester City', u'Manchester City', u'Manchester United', u'Manchester United', u'Southampton FC', u'Southampton FC', u'West Ham United', u'West Ham United', u'Liverpool FC', u'Liverpool FC', u'Chelsea FC', u'Chelsea FC', u'Stoke City', u'Stoke City', u'Swansea City', u'Swansea City', u'Everton FC', u'Everton FC', u'Watford FC', u'Watford FC', u'Crystal Palace', u'Crystal Palace', u'West Bromwich Albion', u'West Bromwich Albion', u'AFC Bournemouth', u'AFC Bournemouth', u'Sunderland AFC', u'Sunderland AFC', u'Newcastle United', u'Newcastle United', u'Norwich City', u'Norwich City', u'Aston Villa', u'Aston Villa']