Python Pandas 正则表达式:在列中搜索带有通配符的字符串并 return 匹配
Python Pandas Regex: Search for strings with a wildcard in a column and return matches
我在一个列中有一个搜索列表,其中可能包含一个键:'keyword1*keyword2'
以尝试在单独的数据框列中找到匹配项。如何包含正则表达式通配符类型 'keyword1.*keyword2'
#using str.extract, extractall or findall?
使用 .str.extract
可以很好地匹配完全匹配的子字符串,但我还需要它来匹配关键字之间带有通配符的子字符串。
# dataframe column or series list as keys to search for:
dfKeys = pd.DataFrame()
dfKeys['SearchFor'] = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]
# col_next_to_SearchFor_col
dfKeys['AdjacentCol'] = ['this other string', 'SomeString Else', 'Second String Player', 'Keyword1 Keyword2', 'More String Stuff', 'One More String Example' ]
# dataframe column to search in:
df1['Description'] = ['Something Here','Second Item 7', 'Something There', 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END', 'Second Item 7', 'Even More Stuff']]
# I've tried:
df1['Matched'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False)
我也尝试用 'extractall' 和 'findall' 替换上面代码中的 'extract' 但它仍然没有给我需要的结果。
我希望 'Keyword1*Keyword2'
匹配 "strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END"
更新:“.*”有效!
我还尝试在 'SearchFor' 列中匹配键旁边的单元格中添加值,即 dfKeys['AdjacentCol']
.
我试过:
df1['From_AdjacentCol'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('SearchFor')['AdjacentCol'].to_dict()).fillna('')
除了带通配符的键外,它适用于所有内容。
# expected:
Description Matched From_AdjacentCol
0 'Something Here' 'Something' 'this other string'
1 'Second Item 7' 'Second' 'Second String Player'
2 'Something There' 'Something' 'this other string'
3 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2...' 'Keyword1*Keyword2' 'Keyword1 Keyword2'
4 'Second Item 7' 'Second' 'Second String Player'
5 'Even More Stuff' 'Stuff' 'More String Stuff'
非常感谢对此的任何帮助。谢谢!
解决方案
您已接近解决方案,只需将 *
更改为 .*
。阅读 docs:
.
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any
character including a newline.
*
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’,
‘ab’, or ‘a’ followed by any number of ‘b’s.
在正则表达式中星号 *
单独没有任何意义。它与 Unix/Windows 文件系统中的常用 glob 运算符 *
具有不同的含义。
星号是一个量词(即gready量词),它必须与某种模式相关联(这里.
匹配任何字符)才有意义。
MCVE
重塑您的 MCVE:
import re
import pandas as pd
keys = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]
df1 = pd.DataFrame()
df1['Description'] = ['Something Here','Second Item 7', 'Something There',
'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END',
'Second Item 7', 'Even More Stuff']
regstr = '(%s)' % '|'.join(keys)
df1['Matched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False)
正则表达式现在是:
(this|Something|Second|Keyword1.*Keyword2|Stuff|One)
并匹配缺失的大小写:
Description Matched
0 Something Here Something
1 Second Item 7 Second
2 Something There Something
3 strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 an... KEYWORD1 moreJARGON 06/0 010 KEYWORD2
4 Second Item 7 Second
5 Even More Stuff Stuff
我在一个列中有一个搜索列表,其中可能包含一个键:'keyword1*keyword2'
以尝试在单独的数据框列中找到匹配项。如何包含正则表达式通配符类型 'keyword1.*keyword2'
#using str.extract, extractall or findall?
使用 .str.extract
可以很好地匹配完全匹配的子字符串,但我还需要它来匹配关键字之间带有通配符的子字符串。
# dataframe column or series list as keys to search for:
dfKeys = pd.DataFrame()
dfKeys['SearchFor'] = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]
# col_next_to_SearchFor_col
dfKeys['AdjacentCol'] = ['this other string', 'SomeString Else', 'Second String Player', 'Keyword1 Keyword2', 'More String Stuff', 'One More String Example' ]
# dataframe column to search in:
df1['Description'] = ['Something Here','Second Item 7', 'Something There', 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END', 'Second Item 7', 'Even More Stuff']]
# I've tried:
df1['Matched'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False)
我也尝试用 'extractall' 和 'findall' 替换上面代码中的 'extract' 但它仍然没有给我需要的结果。
我希望 'Keyword1*Keyword2'
匹配 "strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END"
更新:“.*”有效!
我还尝试在 'SearchFor' 列中匹配键旁边的单元格中添加值,即 dfKeys['AdjacentCol']
.
我试过:
df1['From_AdjacentCol'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('SearchFor')['AdjacentCol'].to_dict()).fillna('')
除了带通配符的键外,它适用于所有内容。
# expected:
Description Matched From_AdjacentCol
0 'Something Here' 'Something' 'this other string'
1 'Second Item 7' 'Second' 'Second String Player'
2 'Something There' 'Something' 'this other string'
3 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2...' 'Keyword1*Keyword2' 'Keyword1 Keyword2'
4 'Second Item 7' 'Second' 'Second String Player'
5 'Even More Stuff' 'Stuff' 'More String Stuff'
非常感谢对此的任何帮助。谢谢!
解决方案
您已接近解决方案,只需将 *
更改为 .*
。阅读 docs:
. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
在正则表达式中星号 *
单独没有任何意义。它与 Unix/Windows 文件系统中的常用 glob 运算符 *
具有不同的含义。
星号是一个量词(即gready量词),它必须与某种模式相关联(这里.
匹配任何字符)才有意义。
MCVE
重塑您的 MCVE:
import re
import pandas as pd
keys = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]
df1 = pd.DataFrame()
df1['Description'] = ['Something Here','Second Item 7', 'Something There',
'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END',
'Second Item 7', 'Even More Stuff']
regstr = '(%s)' % '|'.join(keys)
df1['Matched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False)
正则表达式现在是:
(this|Something|Second|Keyword1.*Keyword2|Stuff|One)
并匹配缺失的大小写:
Description Matched
0 Something Here Something
1 Second Item 7 Second
2 Something There Something
3 strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 an... KEYWORD1 moreJARGON 06/0 010 KEYWORD2
4 Second Item 7 Second
5 Even More Stuff Stuff