Python Pandas 使用另一列删除子字符串
Python Pandas removing substring using another column
我试过四处搜索,但找不到一个简单的方法来做到这一点,所以我希望你的专业知识能有所帮助。
我有一个包含两列的 pandas 数据框
import numpy as np
import pandas as pd
pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})
这给了我
FULL_NAME NAME
0 FIRST LAST FIRST
1 NaN NaN
2 FIRST LAST NAME2
3 FIRST NAME3 NAME3
4 FIRST NAME4 LAST NAME4
5 ANOTHER NAME NAME5
6 LAST NAME NAME6
我想做的是从 'NAME' 列中获取值,然后从 'FULL NAME' 列中删除(如果存在)。所以函数会 return
FULL_NAME NAME NEW
0 FIRST LAST FIRST LAST
1 NaN NaN NaN
2 FIRST LAST NAME2 FIRST LAST
3 FIRST NAME3 NAME3 FIRST
4 FIRST NAME4 LAST NAME4 FIRST LAST
5 ANOTHER NAME NAME5 ANOTHER NAME
6 LAST NAME NAME6 LAST NAME
到目前为止,我已经在下面定义了一个函数并且正在使用 apply 方法。不过,这在我的大型数据集上运行得相当慢,我希望有一种更有效的方法来做到这一点。谢谢!
def address_remove(x):
try:
newADDR1 = re.sub(x['NAME'], '', x[-1])
newADDR1 = newADDR1.rstrip()
newADDR1 = newADDR1.lstrip()
return newADDR1
except:
return x[-1]
这是一个比您当前的解决方案快很多的解决方案,但我不相信不会有更快的解决方案
In [13]: import numpy as np
import pandas as pd
n = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})
这是一种很长的衬垫,但它应该可以满足您的需要
我能想到的禁食解决方案是使用 replace
,如另一个答案中所述:
In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop
原回答:
In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop
与您当前的解决方案相比:
In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop
这些为您提供与当前解决方案相同的答案
我想你想使用字符串具有的 replace() 方法,它比使用正则表达式快几个数量级(我刚刚在 IPython 中快速检查):
%timeit mystr.replace("ello", "")
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 250 ns per loop
%timeit re.sub("ello","", "e")
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 4.7 µs per loop
如果之后需要进一步提高速度,您应该研究 numpy 的矢量化函数(但我认为使用替换而不是正则表达式的速度应该相当可观)。
您可以使用 replace
方法和 regex
参数,然后使用 str.strip
:
In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
Out[605]:
0 LAST
1 NaN
2 FIRST LAST
3 FIRST
4 FIRST LAST
5 ANOTHER NAME
6 LAST NAME
Name: FULL_NAME, dtype: object
注意 您需要将 notnull
传递给 testing.NAME
因为没有它 NaN
值也将被替换为空字符串
Benchmarking 比最快的@johnchase 解决方案慢,但我认为它更具可读性并使用 DataFrames 和 Series 的所有 pandas 方法:
In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
100 loops, best of 3: 4.56 ms per loop
In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
1000 loops, best of 3: 450 µs per loop
我试过四处搜索,但找不到一个简单的方法来做到这一点,所以我希望你的专业知识能有所帮助。
我有一个包含两列的 pandas 数据框
import numpy as np
import pandas as pd
pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})
这给了我
FULL_NAME NAME
0 FIRST LAST FIRST
1 NaN NaN
2 FIRST LAST NAME2
3 FIRST NAME3 NAME3
4 FIRST NAME4 LAST NAME4
5 ANOTHER NAME NAME5
6 LAST NAME NAME6
我想做的是从 'NAME' 列中获取值,然后从 'FULL NAME' 列中删除(如果存在)。所以函数会 return
FULL_NAME NAME NEW
0 FIRST LAST FIRST LAST
1 NaN NaN NaN
2 FIRST LAST NAME2 FIRST LAST
3 FIRST NAME3 NAME3 FIRST
4 FIRST NAME4 LAST NAME4 FIRST LAST
5 ANOTHER NAME NAME5 ANOTHER NAME
6 LAST NAME NAME6 LAST NAME
到目前为止,我已经在下面定义了一个函数并且正在使用 apply 方法。不过,这在我的大型数据集上运行得相当慢,我希望有一种更有效的方法来做到这一点。谢谢!
def address_remove(x):
try:
newADDR1 = re.sub(x['NAME'], '', x[-1])
newADDR1 = newADDR1.rstrip()
newADDR1 = newADDR1.lstrip()
return newADDR1
except:
return x[-1]
这是一个比您当前的解决方案快很多的解决方案,但我不相信不会有更快的解决方案
In [13]: import numpy as np
import pandas as pd
n = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})
这是一种很长的衬垫,但它应该可以满足您的需要
我能想到的禁食解决方案是使用 replace
,如另一个答案中所述:
In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop
原回答:
In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop
与您当前的解决方案相比:
In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop
这些为您提供与当前解决方案相同的答案
我想你想使用字符串具有的 replace() 方法,它比使用正则表达式快几个数量级(我刚刚在 IPython 中快速检查):
%timeit mystr.replace("ello", "")
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 250 ns per loop
%timeit re.sub("ello","", "e")
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 4.7 µs per loop
如果之后需要进一步提高速度,您应该研究 numpy 的矢量化函数(但我认为使用替换而不是正则表达式的速度应该相当可观)。
您可以使用 replace
方法和 regex
参数,然后使用 str.strip
:
In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
Out[605]:
0 LAST
1 NaN
2 FIRST LAST
3 FIRST
4 FIRST LAST
5 ANOTHER NAME
6 LAST NAME
Name: FULL_NAME, dtype: object
注意 您需要将 notnull
传递给 testing.NAME
因为没有它 NaN
值也将被替换为空字符串
Benchmarking 比最快的@johnchase 解决方案慢,但我认为它更具可读性并使用 DataFrames 和 Series 的所有 pandas 方法:
In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
100 loops, best of 3: 4.56 ms per loop
In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
1000 loops, best of 3: 450 µs per loop