pandas 数据集转换以规范化数据

Question

我有一个这样的 csv 文件：

我想将其转换为 pandas 数据框，如下所示：

基本上我正在尝试规范化数据集以填充 sql table.

我已经使用 json_normalize 从 genres 列创建了一个单独的数据集，但我不知道如何转换两个列，如上图所示。

非常感谢一些建议。

Answer 1

如果genre_id是唯一的数值（如图），可以使用如下：

#find all occurrences of digits in the column and convert the list items to comma separated string.
df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)

#use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id') 

#finally remove the extra space
df['genre_id']  = df['genre_id'].str.lstrip() 

#if required create a new dataframe with these 2 columns only
df = df[['id','genre_id']]

pandas 数据集转换以规范化数据

pandas dataset transformation to normalize the data

python

json

pandas

denormalized