如何处理 Pandas 中的缺失值

Question

我想知道当我们有一个包含缺失值的数据集时，处理它们的最佳方法是什么？直接删除它们还是用零替换？

假设我有这些日期：

id	name	price	product_group
1	nd	14.35	care
2	nd	10.02	makeup
3	nd	5.40	nd
4	nd	7.68	nd

我需要分析列 'product group' 中的日期并尝试使用此代码删除值 'nd' 但它不起作用。

    order['product_group'] = order['product_group'].replace('nd', np.nan)
    order['product_group'] = order['product_group'].dropna(how='any')

Answer 1

可以索引 product_group 列中的 'nd' 行，然后从原始数据框中删除它们：

import pandas as pd

i= order[(order.product_group=='nd')].index

order.drop(i)

Answer 2

您应该 dropna() 整个数据框并且 subset product_group 列：

order['product_group'] = order['product_group'].replace('nd', np.nan)
order = order.dropna(subset=['product_group'])

#    id name  price product_group
# 0   1   nd  14.35          care
# 1   2   nd  10.02        makeup

至于为什么你的版本不起作用，请注意，当你 dropna() 单独在列上（没有分配回来）时，它工作正常：

order['product_group'].dropna()

# 0      care
# 1    makeup
# Name: product_group, dtype: object

但是，如果您将这个短系列重新分配到完整数据框中，pandas 不知道如何处理额外的行，只是将 nan 值放回原处。

如何处理 Pandas 中的缺失值

How to deal with misssing values in Pandas

python

product

nan

missing-data

pandas