如何显示完整结果,而不是 python 中正则表达式搜索的匹配文本
How to show full results, rather than matched text from regex searches in python
我正在创建一个基于关键字搜索文件的脚本,我的输出应该是整个观察结果,而不仅仅是匹配的文本,但我发现 .group 对此不起作用。
import re
import os
pers_info = pd.read_csv(r".....StateWorkforceMailingList_2-7-19a.csv",encoding='utf-8')
Pers_info['State'] = Texas, Florida etc...
files=os.listdir(r"....\State Files")
Files = list of WORKFORCE_2017_ALABAMA_FILE.xlsx,...,n
matches=re.findall(pers_info.State[4], files.replace("_", " "),re.I)
print(match)
我的预期输出是 WORKFORCE_2017_ALABAMA_FILE.xlsx
相反,我得到 'Alabama'
我应该尝试布尔掩码吗?
我想你的 Pers_info 看起来像这样:
Pers_info = {"state": ["Texas", "Alabama", "Florida"], "somethingelse": "stuff"}
你的文件是这样的:
files = ["WORKFORCE_2017_ALABAMA_FILE.xlsx","WORKFORCE_2017_TEXAS_FILE.xlsx","SOMETHING.xlsx"]
(你不需要正则表达式)
files = [file.lower() for file in files]
peers = [file.lower() for file in Pers_info['state']]
result = []
for x in peers:
try:
indx = peers.index(x)
if any(peers[indx] in s for s in files):
result.append(files[indx])
except:
break
print(result)
使用
>>> import pandas as pd
>>> Pers_info = pd.DataFrame({'State':['Texas', 'Alabama', 'Florida']})
>>> Files = ['WORKFORCE_2017_ALABAMA_FILE.xlsx', 'WORKFORCE_2017_FILE.xlsx']
>>> pattern = re.compile(rf'(?<![^\W_])(?:{"|".join(Pers_info["State"].to_list())})(?![^\W_])', re.I)
>>> list(filter(pattern.search, Files))
['WORKFORCE_2017_ALABAMA_FILE.xlsx']
参见regex proof。
解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[^\W_] any character except: non-word
characters (all but a-z, A-Z, 0-9, _),
'_'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
Texas 'Texas'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Alabama 'Alabama'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Florida 'Florida'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[^\W_] any character except: non-word
characters (all but a-z, A-Z, 0-9, _),
'_'
--------------------------------------------------------------------------------
) end of look-ahead
我正在创建一个基于关键字搜索文件的脚本,我的输出应该是整个观察结果,而不仅仅是匹配的文本,但我发现 .group 对此不起作用。
import re
import os
pers_info = pd.read_csv(r".....StateWorkforceMailingList_2-7-19a.csv",encoding='utf-8')
Pers_info['State'] = Texas, Florida etc...
files=os.listdir(r"....\State Files")
Files = list of WORKFORCE_2017_ALABAMA_FILE.xlsx,...,n
matches=re.findall(pers_info.State[4], files.replace("_", " "),re.I)
print(match)
我的预期输出是 WORKFORCE_2017_ALABAMA_FILE.xlsx 相反,我得到 'Alabama'
我应该尝试布尔掩码吗?
我想你的 Pers_info 看起来像这样:
Pers_info = {"state": ["Texas", "Alabama", "Florida"], "somethingelse": "stuff"}
你的文件是这样的:
files = ["WORKFORCE_2017_ALABAMA_FILE.xlsx","WORKFORCE_2017_TEXAS_FILE.xlsx","SOMETHING.xlsx"]
(你不需要正则表达式)
files = [file.lower() for file in files]
peers = [file.lower() for file in Pers_info['state']]
result = []
for x in peers:
try:
indx = peers.index(x)
if any(peers[indx] in s for s in files):
result.append(files[indx])
except:
break
print(result)
使用
>>> import pandas as pd
>>> Pers_info = pd.DataFrame({'State':['Texas', 'Alabama', 'Florida']})
>>> Files = ['WORKFORCE_2017_ALABAMA_FILE.xlsx', 'WORKFORCE_2017_FILE.xlsx']
>>> pattern = re.compile(rf'(?<![^\W_])(?:{"|".join(Pers_info["State"].to_list())})(?![^\W_])', re.I)
>>> list(filter(pattern.search, Files))
['WORKFORCE_2017_ALABAMA_FILE.xlsx']
参见regex proof。
解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[^\W_] any character except: non-word
characters (all but a-z, A-Z, 0-9, _),
'_'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
Texas 'Texas'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Alabama 'Alabama'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Florida 'Florida'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[^\W_] any character except: non-word
characters (all but a-z, A-Z, 0-9, _),
'_'
--------------------------------------------------------------------------------
) end of look-ahead