百万行的模糊正则表达式匹配 Pandas df
Fuzzy regex match on million rows Pandas df
我正在尝试检查字符串列和引用列表之间的模糊匹配。字符串系列包含超过 1 m 行,参考列表包含超过 10 k 个条目。
例如:
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows
###Output should look like
df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])
如果单词单独出现在字符串中(并且在该字符串中,最多允许 1 个字符替换),它应该生成匹配项
例如 - 'PARIS' 可以匹配 'PARIS HILTON'、'THE HARIS DOWNTOWN',但不能匹配 'APARISIAN'.
类似地,'XANDER' 匹配 'NOVA XANDER' 和 'SALA MANDER'(MANDER 是 XANDER 的 1 个字符差异),但不会生成匹配 'ALEXANDERS'.
截至目前,我们已经编写了相同的逻辑(如下所示),尽管比赛需要 4 个小时以上才能 运行。需要将其缩短到 30 分钟以内。
当前代码-
tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])
def search_it(partyname):
m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].str.apply(search_it)
此外,多处理对正则表达式有帮助吗?非常感谢!
您的模式效率很低,因为您在正则表达式中重复 tag
模式三次。您只需要创建一个具有所谓空白边界的模式,(?<!\S)
和 (?!\S)
,并且您只需要一个 tag
模式。
接下来,如果您有数千个备选方案,即使是单个 tag
模式正则表达式也会非常慢,因为可能会出现匹配字符串中相同位置的备选方案,因此,将会有回溯太多。
要减少回溯,您需要 。
这是一个工作片段:
import regex
import pandas as pd
## Class to build a regex trie, see
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return regex.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row
trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
trie.add(word)
tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)
def search_it(partyname):
m = tags_ptn_regex.search(partyname)
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].apply(search_it)
我正在尝试检查字符串列和引用列表之间的模糊匹配。字符串系列包含超过 1 m 行,参考列表包含超过 10 k 个条目。
例如:
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows
###Output should look like
df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])
如果单词单独出现在字符串中(并且在该字符串中,最多允许 1 个字符替换),它应该生成匹配项
例如 - 'PARIS' 可以匹配 'PARIS HILTON'、'THE HARIS DOWNTOWN',但不能匹配 'APARISIAN'.
类似地,'XANDER' 匹配 'NOVA XANDER' 和 'SALA MANDER'(MANDER 是 XANDER 的 1 个字符差异),但不会生成匹配 'ALEXANDERS'.
截至目前,我们已经编写了相同的逻辑(如下所示),尽管比赛需要 4 个小时以上才能 运行。需要将其缩短到 30 分钟以内。
当前代码-
tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])
def search_it(partyname):
m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].str.apply(search_it)
此外,多处理对正则表达式有帮助吗?非常感谢!
您的模式效率很低,因为您在正则表达式中重复 tag
模式三次。您只需要创建一个具有所谓空白边界的模式,(?<!\S)
和 (?!\S)
,并且您只需要一个 tag
模式。
接下来,如果您有数千个备选方案,即使是单个 tag
模式正则表达式也会非常慢,因为可能会出现匹配字符串中相同位置的备选方案,因此,将会有回溯太多。
要减少回溯,您需要
这是一个工作片段:
import regex
import pandas as pd
## Class to build a regex trie, see
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return regex.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row
trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
trie.add(word)
tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)
def search_it(partyname):
m = tags_ptn_regex.search(partyname)
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].apply(search_it)