用 Pandas 填充缺失值

Question

我正在尝试找出用 Pandas 填写 CSV 文件中 'region_cd' 和 'model_cd' 字段的最佳方法。 'RevenueProduced' 字段可以告诉您任何缺失字段的正确值是什么。我的想法是在我的数据框中进行一些查询，查找具有相同 'region_cd' 和 'RevenueProduced' 的所有字段，并使所有 'model_cd' 匹配（反之亦然，缺少 'region_cd').

import pandas as pd
import requests as r

#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'



res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')

可能有很多方法可以解决这个问题，但我才刚刚开始 Pandas，至少我很难说。任何帮助都会很棒。

Answer 1

假设每个 RevenueProduced 映射到一个 region_cd 和一个 model_cd。

看看 groupby pandas 函数。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

您可以执行以下操作：

# create mask to grab only regions with values
mask = df['region_cd'].notna()

# group by region, collect the first `RevenueProduced` and reset the index
region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()

# checkout the built-in zip function to understand what's happening here
region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))

# store data in new column, although you could overwrite "region_cd"
df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)

您将使用 model_cd 执行完全相同的过程。我没有运行此代码，因为在撰写本文时我无法访问您的 csv，但我希望这会有所帮助。

这里是.map系列方法的文档。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

（请记住，系列只是数据框中的一列）

用 Pandas 填充缺失值

Filling in missing values with Pandas

python

missing-data

pandas