用其他列中的过滤值填充选定列中的缺失值
Fill missing values in selected columns with filtered values in other column
我在数据框中有一个名为 null
的奇怪列,其中包含其他列中的一些缺失值。一列是名为 location
的经纬度坐标,另一列是表示名为 level
的目标变量的整数。在某些但不是所有 location
或 level
有缺失值的情况下,应该存在的值在此 null
列中。这是一个例子 df:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
}
)
我需要能够根据值是整数还是字符串来过滤 null
列,然后根据该值用适当的值填充适当列中的缺失值。到目前为止,我已经尝试在 for 循环中使用带有 lambda 函数的 .apply()
以及 .match()
、.contains()
和 in
,但没有成功。
让我们试试to_numeric
checker = pd.to_numeric(df.null, errors='coerce')
checker
Out[171]:
0 NaN
1 2.0
2 NaN
3 4.0
4 3.0
Name: null, dtype: float64
并应用 isnull
,如果 return NaN
表示字符串不是 int
isstring = checker.isnull()
Out[172]:
0 True
1 False
2 True
3 False
4 False
Name: null, dtype: bool
# isnumber = checker.notnull()
填充值
df.loc[isnumber, 'location'] = df['null']
df.loc[isstring, 'level'] = df['null']
另一种方法可能使用方法 pandas.Series.mask
:
>>> df
null location level
0 43.70477575,-72.28844073 NaN NaN
1 2 NaN NaN
2 43.70637091,-72.28704334 NaN NaN
3 4 NaN NaN
4 3 NaN NaN
>>> df.level.mask(df.null.str.isnumeric(), other = df.null, inplace = True)
>>> df.location.where(df.null.str.isnumeric(), other = df.null, inplace = True)
>>>
>>> df
null location level
0 43.70477575,-72.28844073 43.70477575,-72.28844073 NaN
1 2 NaN 2
2 43.70637091,-72.28704334 43.70637091,-72.28704334 NaN
3 4 NaN 4
4 3 NaN 3
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html
即使不是最简单的方法,最简单的方法也就是简单地用 df.null
中的值填充 df.location
和 df.level
中的所有缺失值,然后创建一个布尔过滤器正则表达式 return innappropriate/misassigned df.location
中的值和 df.level
到 np.nan
.
pd.fillna()
df = pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
}
)
for col in ['location', 'level']:
df[col].fillna(
value = stress.null,
inplace = True
)
现在我们将使用字符串表达式来更正 mis-assigned 值。
str.contains()
# Converting columns to type str so string methods work
df = df.astype(str)
# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex), 'level'] = np.nan
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)
这是输出:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: '43.70477575,-72.28844073', 1: nan, 2: '43.70637091,-72.28704334', 3: nan, 4: nan},
'level': {0: nan, 1: '2', 2: nan, 3: '4', 4: '3'}
}
)
我在数据框中有一个名为 null
的奇怪列,其中包含其他列中的一些缺失值。一列是名为 location
的经纬度坐标,另一列是表示名为 level
的目标变量的整数。在某些但不是所有 location
或 level
有缺失值的情况下,应该存在的值在此 null
列中。这是一个例子 df:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
}
)
我需要能够根据值是整数还是字符串来过滤 null
列,然后根据该值用适当的值填充适当列中的缺失值。到目前为止,我已经尝试在 for 循环中使用带有 lambda 函数的 .apply()
以及 .match()
、.contains()
和 in
,但没有成功。
让我们试试to_numeric
checker = pd.to_numeric(df.null, errors='coerce')
checker
Out[171]:
0 NaN
1 2.0
2 NaN
3 4.0
4 3.0
Name: null, dtype: float64
并应用 isnull
,如果 return NaN
表示字符串不是 int
isstring = checker.isnull()
Out[172]:
0 True
1 False
2 True
3 False
4 False
Name: null, dtype: bool
# isnumber = checker.notnull()
填充值
df.loc[isnumber, 'location'] = df['null']
df.loc[isstring, 'level'] = df['null']
另一种方法可能使用方法 pandas.Series.mask
:
>>> df
null location level
0 43.70477575,-72.28844073 NaN NaN
1 2 NaN NaN
2 43.70637091,-72.28704334 NaN NaN
3 4 NaN NaN
4 3 NaN NaN
>>> df.level.mask(df.null.str.isnumeric(), other = df.null, inplace = True)
>>> df.location.where(df.null.str.isnumeric(), other = df.null, inplace = True)
>>>
>>> df
null location level
0 43.70477575,-72.28844073 43.70477575,-72.28844073 NaN
1 2 NaN 2
2 43.70637091,-72.28704334 43.70637091,-72.28704334 NaN
3 4 NaN 4
4 3 NaN 3
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html
即使不是最简单的方法,最简单的方法也就是简单地用 df.null
中的值填充 df.location
和 df.level
中的所有缺失值,然后创建一个布尔过滤器正则表达式 return innappropriate/misassigned df.location
中的值和 df.level
到 np.nan
.
pd.fillna()
df = pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
}
)
for col in ['location', 'level']:
df[col].fillna(
value = stress.null,
inplace = True
)
现在我们将使用字符串表达式来更正 mis-assigned 值。
str.contains()
# Converting columns to type str so string methods work
df = df.astype(str)
# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex), 'level'] = np.nan
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)
这是输出:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: '43.70477575,-72.28844073', 1: nan, 2: '43.70637091,-72.28704334', 3: nan, 4: nan},
'level': {0: nan, 1: '2', 2: nan, 3: '4', 4: '3'}
}
)