[python]:将字典项的 pandas 列转换为 DataFrame 中的单独行
[python]: Convert pandas column of dictionary items to individual rows in a DataFrame
我有一个 pandas DataFrame,如下所示:
date_time country src_type edges
2021-05-01 DE home {"home": 10, "nav": 3}
2021-05-03 IN nav {"support": 1}
2021-05-04 AE cart {"chat": 1, "about": 4, "home": 5}
2021-05-07 US about {}
列 edges
是一个包含边 dst_type
到其值 edge_count
的映射的字典。我希望字典中的每个单独项目都是 DataFrame 中的单独一行。
这在查看预期输出时会更清楚:
date_time country src_type dst_type edge_count
2021-05-01 DE home home 10
2021-05-01 DE home nav 3
2021-05-03 IN nav support 1
2021-05-04 AE cart chat 1
2021-05-04 AE cart about 4
2021-05-04 AE cart home 5
原始 DataFrame 中的最后一行被删除,因为 edges
中的字典为空。
date_time country src_type edges
. . .
2021-05-07 US about {}
目前,我正在做以下事情:
records = []
for _, row in df.iterrows():
for dst_type, edge_count in sorted(row["edges"].items()):
records.append(
(row["date_time"], row["country"], row["src_type"], dst_type, edge_count)
)
df = pd.DataFrame.from_records(
records, columns=["date_time", "country", "src_type", "dst_type", "edge_count"]
)
但是,这非常慢,因为遍历 DataFrame 需要时间。我想 向量化 这个操作并使其更快。有任何指示或建议吗?
如果您对此有任何帮助,我将不胜感激,因为它可以优化我们的处理速度,使其更快。谢谢!
可以使用pd.DataFrame()
to convert the dictionary to new columns with dict keys as column labels. Then use .melt()
to convert the new columns to individual rows. Sort by date_time
column as required using .sort_values()
. Finally clean up those rows without value (or with NaN
value) in the resulting edge_count
column using .dropna()
,如下:
df2 = df.drop('edges', axis=1).join(pd.DataFrame(df['edges'].tolist()))
(df2.melt(id_vars=['date_time', 'country', 'src_type'], var_name='dst_type', value_name='edge_count')
.sort_values('date_time')
.dropna(subset=['edge_count'])
)
结果:
date_time country src_type dst_type edge_count
0 2021-05-01 DE home home 10.0
4 2021-05-01 DE home nav 3.0
9 2021-05-03 IN nav support 1.0
18 2021-05-04 AE cart about 4.0
14 2021-05-04 AE cart chat 1.0
2 2021-05-04 AE cart home 5.0
我有一个 pandas DataFrame,如下所示:
date_time country src_type edges
2021-05-01 DE home {"home": 10, "nav": 3}
2021-05-03 IN nav {"support": 1}
2021-05-04 AE cart {"chat": 1, "about": 4, "home": 5}
2021-05-07 US about {}
列 edges
是一个包含边 dst_type
到其值 edge_count
的映射的字典。我希望字典中的每个单独项目都是 DataFrame 中的单独一行。
这在查看预期输出时会更清楚:
date_time country src_type dst_type edge_count
2021-05-01 DE home home 10
2021-05-01 DE home nav 3
2021-05-03 IN nav support 1
2021-05-04 AE cart chat 1
2021-05-04 AE cart about 4
2021-05-04 AE cart home 5
原始 DataFrame 中的最后一行被删除,因为 edges
中的字典为空。
date_time country src_type edges
. . .
2021-05-07 US about {}
目前,我正在做以下事情:
records = []
for _, row in df.iterrows():
for dst_type, edge_count in sorted(row["edges"].items()):
records.append(
(row["date_time"], row["country"], row["src_type"], dst_type, edge_count)
)
df = pd.DataFrame.from_records(
records, columns=["date_time", "country", "src_type", "dst_type", "edge_count"]
)
但是,这非常慢,因为遍历 DataFrame 需要时间。我想 向量化 这个操作并使其更快。有任何指示或建议吗?
如果您对此有任何帮助,我将不胜感激,因为它可以优化我们的处理速度,使其更快。谢谢!
可以使用pd.DataFrame()
to convert the dictionary to new columns with dict keys as column labels. Then use .melt()
to convert the new columns to individual rows. Sort by date_time
column as required using .sort_values()
. Finally clean up those rows without value (or with NaN
value) in the resulting edge_count
column using .dropna()
,如下:
df2 = df.drop('edges', axis=1).join(pd.DataFrame(df['edges'].tolist()))
(df2.melt(id_vars=['date_time', 'country', 'src_type'], var_name='dst_type', value_name='edge_count')
.sort_values('date_time')
.dropna(subset=['edge_count'])
)
结果:
date_time country src_type dst_type edge_count
0 2021-05-01 DE home home 10.0
4 2021-05-01 DE home nav 3.0
9 2021-05-03 IN nav support 1.0
18 2021-05-04 AE cart about 4.0
14 2021-05-04 AE cart chat 1.0
2 2021-05-04 AE cart home 5.0