Python - 用 Regex 模式替换 DataFrame 列表中的单词
Python - Replacing words from list in DataFrame with Regex pattern
我有以下列表和 DataFrame:
mylist = ['foo', 'bar', 'baz']
df = pd.DataFrame({'Col1': ['fooThese', 'barWords', 'baz are', 'FOO: not', 'bAr:- needed'],
'Col2': ['Baz:Neither', 'Foo Are', 'barThese', np.nan, 'but this is fine']})
如果在 DataFrame 中找到,我想替换 mylist 中的字符串。
我可以使用以下正则表达式模式替换一些:
pat = '|'.join([r'\b{}'.format(w) for w in mylist])
df2 = df.replace(pat, '', regex=True)
然而,这并没有放置所有实例。我想要的输出如下:
Col1 Col2
0 These Neither
1 Words Are
2 are These
3 not NaN
4 needed but this is fine
您必须使用 ?i
正则表达式标志,这使得您的替换不区分大小写,同时删除特殊字符:
mydict = {f'(?i){word}': '' for word in mylist}
df2 = df.replace(mydict, regex=True).replace('[:-]', '', regex=True)
Col1 Col2
0 These Neither
1 Words Are
2 are These
3 not NaN
4 needed but this is fine
或者您可以将特殊字符添加到您的词典中,这样您就不必调用 DataFrame.replace
两次:
mydict = {f'(?i){word}': '' for word in mylist}#.update({'[:-]': ''})
mydict['[:-]'] = ''
df2 = df.replace(mydict, regex=True)
Col1 Col2
0 These Neither
1 Words Are
2 are These
3 not NaN
4 needed but this is fine
另一个解决方案
使用Pandas Serie str.replace()
方法
import pandas as pd
mylist = ['foo', 'bar', 'baz']
df = pd.DataFrame({'Col1': ['fooThese', 'barWords', 'baz are', 'FOO: not', 'bAr:- needed'],
'Col2': ['Baz:Neither', 'Foo Are', 'barThese', np.nan, 'but this is fine']})
def replace_str_in_df_with_list(df, list, subst_string):
""" Function which replaces strings in a DataFrame based on a list of strings.
Parameters:
----------
df : <pd.DataFrame> instance
The input DataFrame on which to perform the substitution.
list : list
The list of strings to use for the substitution.
subst_string : str
The substitution string.
Returns:
-------
new_df : <pd.DataFrame> instance
A new DataFrame with strings replaced.
"""
df_new = df.copy()
subst_string = str(subst_string)
# iterate over each columns as a pd.Series() to use that method
for c in df_new:
# iterate over the element of the list
for elem in list:
df_new[c] = df_new[c].str.replace(elem, subst_string, case=False)
return(df_new)
df2 = replace_str_in_df_with_list(df, mylist, '')
不幸的是,此方法在 DataFrame
上不可用(还没有?)。
此处提供的解决方案并不完美,但它不会在应用函数之前修改输入列表。
更多帮助:
https://pandas.pydata.org/pandas-docs/stable/search.html?q=replace
我有以下列表和 DataFrame:
mylist = ['foo', 'bar', 'baz']
df = pd.DataFrame({'Col1': ['fooThese', 'barWords', 'baz are', 'FOO: not', 'bAr:- needed'],
'Col2': ['Baz:Neither', 'Foo Are', 'barThese', np.nan, 'but this is fine']})
如果在 DataFrame 中找到,我想替换 mylist 中的字符串。 我可以使用以下正则表达式模式替换一些:
pat = '|'.join([r'\b{}'.format(w) for w in mylist])
df2 = df.replace(pat, '', regex=True)
然而,这并没有放置所有实例。我想要的输出如下:
Col1 Col2
0 These Neither
1 Words Are
2 are These
3 not NaN
4 needed but this is fine
您必须使用 ?i
正则表达式标志,这使得您的替换不区分大小写,同时删除特殊字符:
mydict = {f'(?i){word}': '' for word in mylist}
df2 = df.replace(mydict, regex=True).replace('[:-]', '', regex=True)
Col1 Col2
0 These Neither
1 Words Are
2 are These
3 not NaN
4 needed but this is fine
或者您可以将特殊字符添加到您的词典中,这样您就不必调用 DataFrame.replace
两次:
mydict = {f'(?i){word}': '' for word in mylist}#.update({'[:-]': ''})
mydict['[:-]'] = ''
df2 = df.replace(mydict, regex=True)
Col1 Col2
0 These Neither
1 Words Are
2 are These
3 not NaN
4 needed but this is fine
另一个解决方案
使用Pandas Serie str.replace()
方法
import pandas as pd
mylist = ['foo', 'bar', 'baz']
df = pd.DataFrame({'Col1': ['fooThese', 'barWords', 'baz are', 'FOO: not', 'bAr:- needed'],
'Col2': ['Baz:Neither', 'Foo Are', 'barThese', np.nan, 'but this is fine']})
def replace_str_in_df_with_list(df, list, subst_string):
""" Function which replaces strings in a DataFrame based on a list of strings.
Parameters:
----------
df : <pd.DataFrame> instance
The input DataFrame on which to perform the substitution.
list : list
The list of strings to use for the substitution.
subst_string : str
The substitution string.
Returns:
-------
new_df : <pd.DataFrame> instance
A new DataFrame with strings replaced.
"""
df_new = df.copy()
subst_string = str(subst_string)
# iterate over each columns as a pd.Series() to use that method
for c in df_new:
# iterate over the element of the list
for elem in list:
df_new[c] = df_new[c].str.replace(elem, subst_string, case=False)
return(df_new)
df2 = replace_str_in_df_with_list(df, mylist, '')
不幸的是,此方法在 DataFrame
上不可用(还没有?)。
此处提供的解决方案并不完美,但它不会在应用函数之前修改输入列表。
更多帮助:
https://pandas.pydata.org/pandas-docs/stable/search.html?q=replace