Python string regex union returns 一堆空字符串

Python string regex union returns a bunch of empty strings

我正在尝试将字符串的串联列表作为正则表达式传递给 re.findall:

re.findall(regex, string)

但我在一对列表中得到的只是一堆空字符串。

re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
# [('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]

位置是这样的列表:

['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', ...]

像这样的手动测试工作:

print(re.findall('miami|zika', 'Zika Outbreak Hits Miami'.lower()))
# ['zika', 'miami']

但我不知道连接位置以创建大型正则表达式有什么问题。也许是这样? locations 包含 24588 个元素。

我目前正在根据 geonamescache 提供的城市和国家/地区创建位置列表:

import geonamescache

gc = geonamescache.GeonamesCache()
countries = [country["name"].lower() for country in list(gc.get_countries().values())]
cities    = [city["name"].lower() for city in list(gc.get_cities().values())]
locations =  countries + cities

我正在使用的文本如下所示:

Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika

查看您的位置列表并在列表中查找空字符串或异常位置名称。

例如:这个很好用

In [1]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba']

In [2]: import re

In [3]: re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
Out[3]: []

In [4]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[4]: ['switzerland']

这不是因为我的列表中有一个空位置

In [5]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', '']

In [6]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[6]:
['switzerland',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

编辑

正如预期的那样,位置中的特殊字符导致了代码中的问题。您可以使用以下代码来创建正则表达式本身,它主要是在干扰正则表达式的地方:

In [21]: [l for l in locations if l.find('(') >= 0]
Out[21]:
['zürich (kreis 11) / seebach',
 'zürich (kreis 11) / oerlikon',
 'zürich (kreis 10) / höngg',
 'zürich (kreis 4) / aussersihl',
 'zürich (kreis 10) / wipkingen',
 'zürich (kreis 11) / affoltern',
 'zürich (kreis 2) / wollishofen',
 'zürich (kreis 3) / sihlfeld',
 'zürich (kreis 6) / unterstrass',
 'zürich (kreis 9) / albisrieden',
 'zürich (kreis 9) / altstetten',
 'stadt winterthur (kreis 1)',
 'zürich (kreis 12)',
 'seen (kreis 3)',
 'zürich (kreis 3)',
 'zürich (kreis 11)',
 'zürich (kreis 9)',
 'oberwinterthur (kreis 2)',
 'zürich (kreis 10)',
 'zürich (kreis 2)',
 'zürich (kreis 8)',
 'zürich (kreis 7)',
 'zürich (kreis 6)',
 'wetter (ruhr)',
 'schwedt (oder)',
 'kempten (allgäu)',
 'kelkheim (taunus)',
 'halle (saale)',
 'frankfurt (oder)',
 'brake (unterweser)',
 'v.s.k.valasai (dindigul-dist.)',
 'dainava (kaunas)',
 'miguel alemán (la doce)',
 'jardines de la silla (jardines)',
 'licenciado benito juárez (campo gobierno)',
 'ampliación san mateo (colonia solidaridad)',
 'kalibo (poblacion)',
 'city of milford (balance)',
 'butte-silver bow (balance)']

使用 re.escape 创建正则表达式以处理特殊字符。您可能还想进行完整的单词匹配,否则 break 中的 brea 等部分单词将匹配

In [21]: locations_regex = re.compile(r'|'.join([re.escape(l) for l in sorted(locations, key=lambda x:-len(x))]))