如何获取一列字典值列表并使用它们的值（而不是键）创建新列

Question

我正在分析来自 Facebook 的 政治广告，这是 ProPublica 的 dataset released here。

有一整列 'targets' 我想分析，但它的格式使得每个观察结果都是 string 形式的 dicts 的 list（例如"[{k1: v1}, {k2: v2}]").

import pandas as pd data = {0: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]', 1: '[{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]', 2: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]', 3: '[]', 4: '[{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]'} df = pd.DataFrame.from_dict(data, orient='index', columns=['targets']) # display(df) targets 0 [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}] 1 [{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}] 2 [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}] 3 [] 4 [{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]

我需要将每个 "target" value 分开成为列 header，每个对应的 "segment" value 成为该列中的一个值.

或者，解决方案是创建一个函数，调用每行中的每个字典键，计算频率吗？

输出应该是这样的：

NAge MinAge Retargeting Region ... Interest Location Granularity Country Gender NAge MinAge Retargeting Region ... Interest Location Granularity Country Gender 0 21 and older 21 people who may be similar to their customers the United States ... NaN NaN NaN NaN 1 18 and older 18 NaN NaN ... Republican Party (United States) country the United States NaN 2 18 and older 18 NaN NaN ... NaN country the United States women```

Reddit 上有人发布了这个解决方案：

import json for id,row in enumerate(df.targets): for d in json.loads(row): df.loc[id,d['target']] = d['segment'] df = df.drop(columns=['targets']) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-53-339ae1670258> in <module> 2 for id,row in enumerate(df.targets): 3 for d in json.loads(row): ----> 4 df.loc[id,d['target']] = d['segment'] 5 6 df = df.drop(columns=['targets']) KeyError: 'segment'

Answer 1

def fix()没有向量化，即便如此，应用到文件中的222186行也只需要591毫秒。
替换列中的NaN，使用.fillna()，否则literal_eval将导致ValueError: malformed node or string: nan
将'null'替换为'None'，否则literal_eval将导致ValueError: malformed node or string: <_ast.Name object at 0x000002219927A0A0>
'targets'行的值都是str类型，可以用ast.literal_eval转成lists。
def fix() 遍历 list 中的 dicts，然后仅使用 values 在 [=31= 中创建 key-value 对]，从而将 dicts 中的每个 list 转换为单个 dict。
- 空 lists 替换为空 dicts，这是 .json_normalize() 在列上工作所必需的。
pandas.json_normalized() 可以很容易地在列上使用。
另请参阅
，了解使用相同数据的替代方法。
- 如所示，当 'targets' 列被扩展为长向（整洁的格式）时，使用 .groupby 并聚合 .count().

import pandas as pd
from ast import literal_eval

# load the file
df = pd.read_csv('en-US.csv')

# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')

# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')

# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)

# function to transform the list of dicts in each row
def fix(col):
    dd = dict()
    for d in col:
        values = list(d.values())
        if len(values) == 2:
            dd[values[0]] = values[1]
    return dd

# apply the function to targets
df.targets = df.targets.apply(fix)

# display(df.targets.head())
                                                                                                                                  targets
0     {'Age': '18 and older', 'MinAge': '18', 'Segment': 'Multicultural affinity: African American (US).', 'Region': 'the United States'}
1   {'Age': '45 and older', 'MinAge': '45', 'Retargeting': 'people who may be similar to their customers', 'Region': 'the United States'}
2                                                                              {'Age': '18 and older', 'MinAge': '18', 'Region': 'Texas'}
3                                                                                                                                      {}
4  {'Interest': 'The Washington Post', 'Gender': 'men', 'Age': '34 to 49', 'MinAge': '34', 'MaxAge': '49', 'Region': 'the United States'}

# normalize the targets column
normalized = pd.json_normalize(df.targets)

# join normalized back to df if desired
df = df.join(normalized).drop(columns=['targets'])

`normalized` 宽格式，用于示例数据

# display(normalized.head())
            Age MinAge                                         Segment             Region                                   Retargeting             Interest Gender MaxAge
0  18 and older     18  Multicultural affinity: African American (US).  the United States                                           NaN                  NaN    NaN    NaN
1  45 and older     45                                             NaN  the United States  people who may be similar to their customers                  NaN    NaN    NaN
2  18 and older     18                                             NaN              Texas                                           NaN                  NaN    NaN    NaN
3           NaN    NaN                                             NaN                NaN                                           NaN                  NaN    NaN    NaN
4      34 to 49     34                                             NaN  the United States                                           NaN  The Washington Post    men     49

`normalized` 宽格式，用于完整数据集

从.info()可以看出，targets列包含了很多不同的keys，但并不是所有的行都包含了所有的keys，所以有很多NaNs
为了获得这种宽数据格式的唯一值计数，请使用 normalized.Age.value_counts().

print(normalized.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222186 entries, 0 to 222185
Data columns (total 26 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Age                              157816 non-null  object
 1   MinAge                           156531 non-null  object
 2   Segment                          12288 non-null   object
 3   Region                           111638 non-null  object
 4   Retargeting                      39286 non-null   object
 5   Interest                         31514 non-null   object
 6   Gender                           7194 non-null    object
 7   MaxAge                           7767 non-null    object
 8   City                             23685 non-null   object
 9   State                            23685 non-null   object
 10  Website                          6235 non-null    object
 11  Language                         2584 non-null    object
 12  Audience Owner                   17859 non-null   object
 13  Location Granularity             29770 non-null   object
 14  Location Type                    29770 non-null   object
 15  Agency                           400 non-null     object
 16  List                             5034 non-null    object
 17  Custom Audience Match Key        1144 non-null    object
 18  Mobile App                       50 non-null      object
 19  Country                          22118 non-null   object
 20  Activity on the Facebook Family  3382 non-null    object
 21  Like                             855 non-null     object
 22  Education                        151 non-null     object
 23  Job Title                        15 non-null      object
 24  Relationship Status              22 non-null      object
 25  Employer                         4 non-null       object
dtypes: object(26)
memory usage: 44.1+ MB

如何获取一列字典值列表并使用它们的值（而不是键）创建新列

How to take a column of lists of dictionary values and create new columns using their values (not keys)

python

pandas

json-normalize

`normalized` 宽格式，用于示例数据

`normalized` 宽格式，用于完整数据集

如何获取一列字典值列表并使用它们的值（而不是键）创建新列

How to take a column of lists of dictionary values and create new columns using their values (not keys)

python

pandas

json-normalize

normalized 宽格式，用于示例数据

normalized 宽格式，用于完整数据集

`normalized` 宽格式，用于示例数据

`normalized` 宽格式，用于完整数据集