Python 拆分具有公制和英制单位的列
Python Split a column that has both metric and imperial units
我有一列混合了多种单位,我需要将其分成两列:一列用于公制(mm、cm、m),一列用于英制(in、ft、yd)
d = {'col1': [1 in, 2 mm, 3 ft, 4 yd, 5 m, 6cm]}
df = pd.DataFrame(data=d)
将其拆分为:
Index df_metric df_imperial
0 | | 1 in
---------------------------------
1 | 2 mm |
---------------------------------
2 | | 3 ft
---------------------------------
3 | | 4 yd
---------------------------------
4 | 5 m |
---------------------------------
5 | 6 cm |
我试过:
def seperate_units(df, col, numbers):
if numbers.find('yd') > -1 or numbers.find('in') > -1 or numbers.find('ft') > -1 or numbers.find('"') > -1:
print(numbers)
df[col+'_imperial'].append(numbers)
else:
df[col+'_imperial'].append('')
return df[col+'_imperial']
但我无法让它工作
假设这个输入数据帧:
df = pd.DataFrame({'col1': ['1 in', '2 mm', '3 ft', '4 yd', '5 m', '6cm']})
您可以使用正则表达式来查找公制单位并在这种情况下拆分:
metric = df.col1.str.match('\d+\s*[cm]?m')
pd.concat([df.where(metric, '').add_suffix('_metric'),
df.where(~metric, '').add_suffix('_imperial')],
axis=1)
此处正则表达式匹配一个数字后跟 cm/m/mm,您可以根据您的实际用例更新它。
在 where
中,我替换为空字符串 ''
,但您可以将其删除以使用 NaN,或者根据需要将其替换为其他任何内容。
输出:
col1_metric col1_imperial
0 1 in
1 2 mm
2 3 ft
3 4 yd
4 5 m
5 6cm
使用:
m=df['col1'].str.contains(r'mm|cm|m')
#checking for metric
y=df['col1'].str.contains(r'in|ft|yd')
#checking for imperial
最后:
df.loc[:,'df_metric']=df.loc[m,'col1']
df.loc[:,'df_imperial']=df.loc[y,'col1']
#If needed:
#df[['df_metric','df_imperial']]=df[['df_metric','df_imperial']].fillna('')
现在你打印 df
你会得到预期的输出
尝试pandaspandas.Series.str.contains
d = {'col1': ['1 in', '2 mm', '3 ft', '4 yd', '5 m', '6cm']}
df = pd.DataFrame(data=d)
df['metric'] = df[df['col1'].str.contains(r'mm|cm|m')]['col1']
df['imperial'] = df[df['col1'].str.contains(r'in|ft|yd')]['col1']
print(df)
# col1 metric imperial
# 0 1 in NaN 1 in
# 1 2 mm 2 mm NaN
# 2 3 ft NaN 3 ft
# 3 4 yd NaN 4 yd
# 4 5 m 5 m NaN
# 5 6cm 6cm NaN
我有一列混合了多种单位,我需要将其分成两列:一列用于公制(mm、cm、m),一列用于英制(in、ft、yd)
d = {'col1': [1 in, 2 mm, 3 ft, 4 yd, 5 m, 6cm]}
df = pd.DataFrame(data=d)
将其拆分为:
Index df_metric df_imperial
0 | | 1 in
---------------------------------
1 | 2 mm |
---------------------------------
2 | | 3 ft
---------------------------------
3 | | 4 yd
---------------------------------
4 | 5 m |
---------------------------------
5 | 6 cm |
我试过:
def seperate_units(df, col, numbers):
if numbers.find('yd') > -1 or numbers.find('in') > -1 or numbers.find('ft') > -1 or numbers.find('"') > -1:
print(numbers)
df[col+'_imperial'].append(numbers)
else:
df[col+'_imperial'].append('')
return df[col+'_imperial']
但我无法让它工作
假设这个输入数据帧:
df = pd.DataFrame({'col1': ['1 in', '2 mm', '3 ft', '4 yd', '5 m', '6cm']})
您可以使用正则表达式来查找公制单位并在这种情况下拆分:
metric = df.col1.str.match('\d+\s*[cm]?m')
pd.concat([df.where(metric, '').add_suffix('_metric'),
df.where(~metric, '').add_suffix('_imperial')],
axis=1)
此处正则表达式匹配一个数字后跟 cm/m/mm,您可以根据您的实际用例更新它。
在 where
中,我替换为空字符串 ''
,但您可以将其删除以使用 NaN,或者根据需要将其替换为其他任何内容。
输出:
col1_metric col1_imperial
0 1 in
1 2 mm
2 3 ft
3 4 yd
4 5 m
5 6cm
使用:
m=df['col1'].str.contains(r'mm|cm|m')
#checking for metric
y=df['col1'].str.contains(r'in|ft|yd')
#checking for imperial
最后:
df.loc[:,'df_metric']=df.loc[m,'col1']
df.loc[:,'df_imperial']=df.loc[y,'col1']
#If needed:
#df[['df_metric','df_imperial']]=df[['df_metric','df_imperial']].fillna('')
现在你打印 df
你会得到预期的输出
尝试pandaspandas.Series.str.contains
d = {'col1': ['1 in', '2 mm', '3 ft', '4 yd', '5 m', '6cm']}
df = pd.DataFrame(data=d)
df['metric'] = df[df['col1'].str.contains(r'mm|cm|m')]['col1']
df['imperial'] = df[df['col1'].str.contains(r'in|ft|yd')]['col1']
print(df)
# col1 metric imperial
# 0 1 in NaN 1 in
# 1 2 mm 2 mm NaN
# 2 3 ft NaN 3 ft
# 3 4 yd NaN 4 yd
# 4 5 m 5 m NaN
# 5 6cm 6cm NaN