为什么我通过在代码中以不同的顺序放置 'years' 和 'year' 来获得不同的输出

Question

我所做的只是将 'year' 和 'years' 的位置从第一行切换到第二行，反之亦然..

这里是原来的专栏

10+ years    653
< 1 year     249
2 years      243
3 years      235
5 years      202
4 years      191
1 year       177
6 years      163
7 years      127
8 years      108
9 years       72
.              2
Name: Employment.Length, dtype: int64

第一个例子（第一行'years'，第二行'year'）

raw_data['Employment.Length'] = raw_data['Employment.Length'].str.replace('years',' ')
raw_data['Employment.Length'] = raw_data['Employment.Length'].str.replace('year',' ')
raw_data['Employment.Length'] = np.where(raw_data['Employment.Length'].str[:2]=='10',10,raw_data['Employment.Length'])
raw_data['Employment.Length'] = np.where(raw_data['Employment.Length'].str[0]=='<',0,raw_data['Employment.Length'])
raw_data['Employment.Length'] = pd.to_numeric(raw_data['Employment.Length'], errors = 'coerce')

输出

10.0    653
0.0     249
2.0     243
3.0     235
5.0     202
4.0     191
1.0     177
6.0     163
7.0     127
8.0     108
9.0      72
Name: Employment.Length, dtype: int64

第二个例子（第一行'year'，第二行'years'）

raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('year',' ')
raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[:2]=='10',10, raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[0]=='<',0,raw_data_copy['Employment.Length'])
raw_data_copy['Employment.Length'] = pd.to_numeric(raw_data_copy['Employment.Length'], errors = 'coerce')

输出

10.0    653
0.0     249
1.0     177
Name: Employment.Length, dtype: int64

还有一件事是，当我用 'year' 注释掉我的第二行时，它给我的输出与第一个示例相同。当我用 'years' 注释掉我的第二行时，它给我的输出与第二个示例相同。

第三个例子

 raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
    #raw_data_copy['Employment.Length'] = raw_data_copy['Employment.Length'].str.replace('years',' ')
    raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[:2]=='10',10, raw_data_copy['Employment.Length'])
    raw_data_copy['Employment.Length'] = np.where(raw_data_copy['Employment.Length'].str[0]=='<',0,raw_data_copy['Employment.Length'])
    raw_data_copy['Employment.Length'] = pd.to_numeric(raw_data_copy['Employment.Length'], errors = 'coerce')

输出

10.0    653
0.0     249
2.0     243
3.0     235
5.0     202
4.0     191
6.0     163
7.0     127
8.0     108
9.0      72
Name: Employment.Length, dtype: int64

Answer 1

如果您首先将 'year' 替换为 ' '，则 'years' 将变为 ' s'，并且 's' 不再被您后续的 [=16] 替换=].

不要使用多个后续替换，而是使用一个带有可选 s 的替换：'year[s]?'

import pandas as pd
s = pd.Series(['year', 'years', 'foo'])

s.str.replace('year[s]?', ' ')
#0       
#1       
#2    foo
#dtype: object

为什么我通过在代码中以不同的顺序放置 'years' 和 'year' 来获得不同的输出

why do i get different output by placing 'years' and 'year' in my code, in different order in the code

python-3.x

pandas

data-preprocessing