如何清理这些数据
How to clean this data
来自这里:
+------+------+--------------------------+-----------------+
| code | type | name | final_component |
+------+------+--------------------------+-----------------+
| C001 | ACT | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | SET | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | AL | | |
| C002 | ACT | Spray Pump Motor 1 Pump | |
| C002 | SET | Spray Pump Motor 1 Pump | |
| C003 | ACT | Spray Pump Motor 2 Pump | |
| C003 | SET | Spray Pump Motor 2 Pump | |
| C004 | ACT | Spray Pump Motor 3 Pump | |
| C004 | SET | Spray Pump Motor 3 Pump | |
+------+------+--------------------------+-----------------+
预计:
+------+------+--------------------------+--------------------------+
| code | type | name | final_component |
+------+------+--------------------------+--------------------------+
| C001 | ACT | Exhaust Blower Drive | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | SET | Exhaust Blower Drive | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C002 | ACT | Spray Pump Motor 1 Pump | Spray Pump Motor 1 Pump |
| C002 | SET | Spray Pump Motor 1 Pump | Spray Pump Motor 1 Pump |
| C003 | ACT | Spray Pump Motor 2 Pump | Spray Pump Motor 2 Pump |
| C003 | SET | Spray Pump Motor 2 Pump | Spray Pump Motor 2 Pump |
| C004 | ACT | Spray Pump Motor 3 Pump | Spray Pump Motor 3 Pump |
| C004 | SET | Spray Pump Motor 3 Pump | Spray Pump Motor 3 Pump |
+------+------+--------------------------+--------------------------+
对于所有相同的代码,我必须将类型为 'SET' 的名称值复制到 final_component
与 C001 一样,类型 'SET' 的名称是 Exhaust Blower Drive
对于所有 C001
,我必须将其复制到 final_component
for ind in dataframe.index:
if dataframe['final_component'][ind]!=None:
temp = dataframe['final_component'][ind]
temp_code = dataframe['code'][ind]
i = ind
while dataframe['code'][i] == temp_code:
dataframe['final_component'][ind] = temp
i+=1
我可以想出这个
但它陷入了 while 循环
这是一种方法。首先,re-create数据框:
from io import StringIO
import pandas as pd
data = '''| code | type | name | final_component |
| C001 | ACT | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | SET | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | AL | | |
| C002 | ACT | Spray Pump Motor 1 Pump | |
| C002 | SET | Spray Pump Motor 1 Pump | |
| C003 | ACT | Spray Pump Motor 2 Pump | |
| C003 | SET | Spray Pump Motor 2 Pump | |
| C004 | ACT | Spray Pump Motor 3 Pump | |
| C004 | SET | Spray Pump Motor 3 Pump | |
'''
df = pd.read_csv(StringIO(data), sep='|',)
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 5'])
现在,删除前导和尾随空格:
# remove leading / trailing spaces
df.columns = [c.strip() for c in df.columns]
for col in df.columns:
if df[col].dtype == object:
df[col] = df[col].str.strip()
并填充 final_component
:
# populate 'final component'
df['final_component'] = df['name']
现在用 None
替换空字符串并使用 ffill()
# find final component that is empty string...
mask = df['final_component'] == ''
# ... and convert to None...
df.loc[mask, 'final_component'] = None
# ...so we can use ffill()
df['final_component'] = df['final_component'].ffill()
print(df)
code type name final_component
0 C001 ACT Exhaust Blower Drive Exhaust Blower Drive
1 C001 AL Exhaust Blower Drive
2 C001 AL Exhaust Blower Drive
3 C001 SET Exhaust Blower Drive Exhaust Blower Drive
4 C001 AL Exhaust Blower Drive
5 C001 AL Exhaust Blower Drive
6 C001 AL Exhaust Blower Drive
7 C002 ACT Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
8 C002 SET Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
9 C003 ACT Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
10 C003 SET Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
11 C004 ACT Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
12 C004 SET Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
解决方案 1:当数据按顺序分组时
如果您在 'name'
字段中的数据已经有 Null 值,那么您可以做一些像 ffill() 这样简单的事情。 Pandas dataframe.ffill() 函数用于填充数据框中的缺失值。 “ffill”代表“前向填充”,并将向前传播最后的有效观察。在这种情况下,它不考虑 code
中的值。如果您也想考虑这一点,请查看解决方案 2。
import pandas as pd
import numpy as np
a = {'code':['C001']*7+['C002']*2+['C003']*2+['C004']*2,
'typ':['ACT','AL','AL','SET','AL','AL','AL','ACT','SET','ACT','SET','ACT','SET'],
'name':['Exhaust Blower Drive',None,None,'Exhaust Blower Drive',np.nan,np.nan,np.nan,
'Spray Pump Motor 1 Pump','Spray Pump Motor 1 Pump',
'Spray Pump Motor 2 Pump','Spray Pump Motor 2 Pump',
'Spray Pump Motor 3 Pump','Spray Pump Motor 3 Pump']}
df = pd.DataFrame(a)
#copy all the values from name to final_component' with ffill()
#it will fill the values where data does not exist
#this will work only if you think all values above are part of the same set
df['final_component'] = df['name'].ffill()
解决方案 2:当数据必须基于另一个列值时
代码中如果需要按值填充,可以使用下面的解决方法。
您可以进行查找,然后更新值。尝试这样的事情。
import pandas as pd
import numpy as np
a = {'code':['C001']*7+['C002']*2+['C003']*2+['C004']*2,
'typ':['ACT','AL','AL','SET','AL','AL','AL','ACT','SET','ACT','SET','ACT','SET'],
'name':['Exhaust Blower Drive',np.nan,np.nan,'Exhaust Blower Drive',np.nan,np.nan,np.nan,
'Spray Pump Motor 1 Pump','Spray Pump Motor 1 Pump',
'Spray Pump Motor 2 Pump','Spray Pump Motor 2 Pump',
'Spray Pump Motor 3 Pump','Spray Pump Motor 3 Pump']}
df = pd.DataFrame(a)
#copy all the values from name to final_component' including nulls
df['final_component'] = df['name']
#create a sublist of items based on unique values in code
lookup = df[['code', 'final_component']].groupby('code').first()['final_component']
#identify all the null values that need to be replaced
noname=df['final_component'].isnull()
#replace all null values with correct value based on lookup
df['final_component'].loc[noname] = df.loc[noname].apply(lambda x: lookup[x['code']], axis=1)
print(df)
输出将如下所示:
code typ name final_component
0 C001 ACT Exhaust Blower Drive Exhaust Blower Drive
1 C001 AL NaN Exhaust Blower Drive
2 C001 AL NaN Exhaust Blower Drive
3 C001 SET Exhaust Blower Drive Exhaust Blower Drive
4 C001 AL NaN Exhaust Blower Drive
5 C001 AL NaN Exhaust Blower Drive
6 C001 AL NaN Exhaust Blower Drive
7 C002 ACT Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
8 C002 SET Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
9 C003 ACT Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
10 C003 SET Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
11 C004 ACT Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
12 C004 SET Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
来自这里:
+------+------+--------------------------+-----------------+
| code | type | name | final_component |
+------+------+--------------------------+-----------------+
| C001 | ACT | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | SET | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | AL | | |
| C002 | ACT | Spray Pump Motor 1 Pump | |
| C002 | SET | Spray Pump Motor 1 Pump | |
| C003 | ACT | Spray Pump Motor 2 Pump | |
| C003 | SET | Spray Pump Motor 2 Pump | |
| C004 | ACT | Spray Pump Motor 3 Pump | |
| C004 | SET | Spray Pump Motor 3 Pump | |
+------+------+--------------------------+-----------------+
预计:
+------+------+--------------------------+--------------------------+
| code | type | name | final_component |
+------+------+--------------------------+--------------------------+
| C001 | ACT | Exhaust Blower Drive | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | SET | Exhaust Blower Drive | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C001 | AL | | Exhaust Blower Drive |
| C002 | ACT | Spray Pump Motor 1 Pump | Spray Pump Motor 1 Pump |
| C002 | SET | Spray Pump Motor 1 Pump | Spray Pump Motor 1 Pump |
| C003 | ACT | Spray Pump Motor 2 Pump | Spray Pump Motor 2 Pump |
| C003 | SET | Spray Pump Motor 2 Pump | Spray Pump Motor 2 Pump |
| C004 | ACT | Spray Pump Motor 3 Pump | Spray Pump Motor 3 Pump |
| C004 | SET | Spray Pump Motor 3 Pump | Spray Pump Motor 3 Pump |
+------+------+--------------------------+--------------------------+
对于所有相同的代码,我必须将类型为 'SET' 的名称值复制到 final_component 与 C001 一样,类型 'SET' 的名称是 Exhaust Blower Drive 对于所有 C001
,我必须将其复制到 final_componentfor ind in dataframe.index:
if dataframe['final_component'][ind]!=None:
temp = dataframe['final_component'][ind]
temp_code = dataframe['code'][ind]
i = ind
while dataframe['code'][i] == temp_code:
dataframe['final_component'][ind] = temp
i+=1
我可以想出这个 但它陷入了 while 循环
这是一种方法。首先,re-create数据框:
from io import StringIO
import pandas as pd
data = '''| code | type | name | final_component |
| C001 | ACT | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | SET | Exhaust Blower Drive | |
| C001 | AL | | |
| C001 | AL | | |
| C001 | AL | | |
| C002 | ACT | Spray Pump Motor 1 Pump | |
| C002 | SET | Spray Pump Motor 1 Pump | |
| C003 | ACT | Spray Pump Motor 2 Pump | |
| C003 | SET | Spray Pump Motor 2 Pump | |
| C004 | ACT | Spray Pump Motor 3 Pump | |
| C004 | SET | Spray Pump Motor 3 Pump | |
'''
df = pd.read_csv(StringIO(data), sep='|',)
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 5'])
现在,删除前导和尾随空格:
# remove leading / trailing spaces
df.columns = [c.strip() for c in df.columns]
for col in df.columns:
if df[col].dtype == object:
df[col] = df[col].str.strip()
并填充 final_component
:
# populate 'final component'
df['final_component'] = df['name']
现在用 None
替换空字符串并使用 ffill()
# find final component that is empty string...
mask = df['final_component'] == ''
# ... and convert to None...
df.loc[mask, 'final_component'] = None
# ...so we can use ffill()
df['final_component'] = df['final_component'].ffill()
print(df)
code type name final_component
0 C001 ACT Exhaust Blower Drive Exhaust Blower Drive
1 C001 AL Exhaust Blower Drive
2 C001 AL Exhaust Blower Drive
3 C001 SET Exhaust Blower Drive Exhaust Blower Drive
4 C001 AL Exhaust Blower Drive
5 C001 AL Exhaust Blower Drive
6 C001 AL Exhaust Blower Drive
7 C002 ACT Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
8 C002 SET Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
9 C003 ACT Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
10 C003 SET Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
11 C004 ACT Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
12 C004 SET Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
解决方案 1:当数据按顺序分组时
如果您在 'name'
字段中的数据已经有 Null 值,那么您可以做一些像 ffill() 这样简单的事情。 Pandas dataframe.ffill() 函数用于填充数据框中的缺失值。 “ffill”代表“前向填充”,并将向前传播最后的有效观察。在这种情况下,它不考虑 code
中的值。如果您也想考虑这一点,请查看解决方案 2。
import pandas as pd
import numpy as np
a = {'code':['C001']*7+['C002']*2+['C003']*2+['C004']*2,
'typ':['ACT','AL','AL','SET','AL','AL','AL','ACT','SET','ACT','SET','ACT','SET'],
'name':['Exhaust Blower Drive',None,None,'Exhaust Blower Drive',np.nan,np.nan,np.nan,
'Spray Pump Motor 1 Pump','Spray Pump Motor 1 Pump',
'Spray Pump Motor 2 Pump','Spray Pump Motor 2 Pump',
'Spray Pump Motor 3 Pump','Spray Pump Motor 3 Pump']}
df = pd.DataFrame(a)
#copy all the values from name to final_component' with ffill()
#it will fill the values where data does not exist
#this will work only if you think all values above are part of the same set
df['final_component'] = df['name'].ffill()
解决方案 2:当数据必须基于另一个列值时
代码中如果需要按值填充,可以使用下面的解决方法。
您可以进行查找,然后更新值。尝试这样的事情。
import pandas as pd
import numpy as np
a = {'code':['C001']*7+['C002']*2+['C003']*2+['C004']*2,
'typ':['ACT','AL','AL','SET','AL','AL','AL','ACT','SET','ACT','SET','ACT','SET'],
'name':['Exhaust Blower Drive',np.nan,np.nan,'Exhaust Blower Drive',np.nan,np.nan,np.nan,
'Spray Pump Motor 1 Pump','Spray Pump Motor 1 Pump',
'Spray Pump Motor 2 Pump','Spray Pump Motor 2 Pump',
'Spray Pump Motor 3 Pump','Spray Pump Motor 3 Pump']}
df = pd.DataFrame(a)
#copy all the values from name to final_component' including nulls
df['final_component'] = df['name']
#create a sublist of items based on unique values in code
lookup = df[['code', 'final_component']].groupby('code').first()['final_component']
#identify all the null values that need to be replaced
noname=df['final_component'].isnull()
#replace all null values with correct value based on lookup
df['final_component'].loc[noname] = df.loc[noname].apply(lambda x: lookup[x['code']], axis=1)
print(df)
输出将如下所示:
code typ name final_component
0 C001 ACT Exhaust Blower Drive Exhaust Blower Drive
1 C001 AL NaN Exhaust Blower Drive
2 C001 AL NaN Exhaust Blower Drive
3 C001 SET Exhaust Blower Drive Exhaust Blower Drive
4 C001 AL NaN Exhaust Blower Drive
5 C001 AL NaN Exhaust Blower Drive
6 C001 AL NaN Exhaust Blower Drive
7 C002 ACT Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
8 C002 SET Spray Pump Motor 1 Pump Spray Pump Motor 1 Pump
9 C003 ACT Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
10 C003 SET Spray Pump Motor 2 Pump Spray Pump Motor 2 Pump
11 C004 ACT Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump
12 C004 SET Spray Pump Motor 3 Pump Spray Pump Motor 3 Pump