如何获取一列字典值列表并使用它们的值(而不是键)创建新列

How to take a column of lists of dictionary values and create new columns using their values (not keys)

我正在分析来自 Facebook 的 政治广告,这是 ProPublica 的 dataset released here

有一整列 'targets' 我想分析,但它的格式使得每个观察结果都是 string 形式的 dictslist(例如"[{k1: v1}, {k2: v2}]").

import pandas as pd

data = {0: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]', 1: '[{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]', 2: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]', 3: '[]', 4: '[{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]'}

df = pd.DataFrame.from_dict(data, orient='index', columns=['targets'])

# display(df)
                                                                                                                                                                                                                                                                            targets
0                                                   [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]
1                                                 [{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]
2                                                                                                                               [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]
3                                                                                                                                                                                                                                                                                []
4  [{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]

我需要将每个 "target" value 分开成为列 header,每个对应的 "segment" value 成为该列中的一个值.

或者,解决方案是创建一个函数,调用每行中的每个字典键,计算频率吗?

输出应该是这样的:

           NAge MinAge                                   Retargeting             Region  ...                          Interest Location Granularity            Country Gender           NAge MinAge                                   Retargeting             Region  ...                          Interest Location Granularity            Country Gender
0  21 and older     21  people who may be similar to their customers  the United States  ...                               NaN                  NaN                NaN    NaN
1  18 and older     18                                           NaN                NaN  ...  Republican Party (United States)              country  the United States    NaN
2  18 and older     18                                           NaN                NaN  ...                               NaN              country  the United States  women```

Reddit 上有人发布了这个解决方案:

import json

for id,row in enumerate(df.targets):
    for d in json.loads(row):
        df.loc[id,d['target']] = d['segment']

df = df.drop(columns=['targets'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-53-339ae1670258> in <module>
      2 for id,row in enumerate(df.targets):
      3     for d in json.loads(row):
----> 4         df.loc[id,d['target']] = d['segment']
      5 
      6 df = df.drop(columns=['targets'])

KeyError: 'segment'
  • def fix()没有向量化,即便如此,应用到文件中的222186行也只需要591毫秒。
  • 替换列中的NaN,使用.fillna(),否则literal_eval将导致ValueError: malformed node or string: nan
  • 'null'替换为'None',否则literal_eval将导致ValueError: malformed node or string: <_ast.Name object at 0x000002219927A0A0>
  • 'targets'行的值都是str类型,可以用ast.literal_eval转成lists
  • def fix() 遍历 list 中的 dicts,然后仅使用 values 在 [=31= 中创建 key-value 对],从而将 dicts 中的每个 list 转换为单个 dict
    • lists 替换为空 dicts,这是 .json_normalize() 在列上工作所必需的。
  • pandas.json_normalized() 可以很容易地在列上使用。
  • 另请参阅 ,了解使用相同数据的替代方法。
    • 所示,当 'targets' 列被扩展为长向(整洁的格式)时,使用 .groupby 并聚合 .count().
import pandas as pd
from ast import literal_eval

# load the file
df = pd.read_csv('en-US.csv')

# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')

# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')

# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)

# function to transform the list of dicts in each row
def fix(col):
    dd = dict()
    for d in col:
        values = list(d.values())
        if len(values) == 2:
            dd[values[0]] = values[1]
    return dd

# apply the function to targets
df.targets = df.targets.apply(fix)

# display(df.targets.head())
                                                                                                                                  targets
0     {'Age': '18 and older', 'MinAge': '18', 'Segment': 'Multicultural affinity: African American (US).', 'Region': 'the United States'}
1   {'Age': '45 and older', 'MinAge': '45', 'Retargeting': 'people who may be similar to their customers', 'Region': 'the United States'}
2                                                                              {'Age': '18 and older', 'MinAge': '18', 'Region': 'Texas'}
3                                                                                                                                      {}
4  {'Interest': 'The Washington Post', 'Gender': 'men', 'Age': '34 to 49', 'MinAge': '34', 'MaxAge': '49', 'Region': 'the United States'}

# normalize the targets column
normalized = pd.json_normalize(df.targets)

# join normalized back to df if desired
df = df.join(normalized).drop(columns=['targets'])

normalized 宽格式,用于示例数据

# display(normalized.head())
            Age MinAge                                         Segment             Region                                   Retargeting             Interest Gender MaxAge
0  18 and older     18  Multicultural affinity: African American (US).  the United States                                           NaN                  NaN    NaN    NaN
1  45 and older     45                                             NaN  the United States  people who may be similar to their customers                  NaN    NaN    NaN
2  18 and older     18                                             NaN              Texas                                           NaN                  NaN    NaN    NaN
3           NaN    NaN                                             NaN                NaN                                           NaN                  NaN    NaN    NaN
4      34 to 49     34                                             NaN  the United States                                           NaN  The Washington Post    men     49

normalized 宽格式,用于完整数据集

  • .info()可以看出,targets列包含了很多不同的keys,但并不是所有的行都包含了所有的keys,所以有很多NaNs
  • 为了获得这种宽数据格式的唯一值计数,请使用 normalized.Age.value_counts().
print(normalized.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222186 entries, 0 to 222185
Data columns (total 26 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Age                              157816 non-null  object
 1   MinAge                           156531 non-null  object
 2   Segment                          12288 non-null   object
 3   Region                           111638 non-null  object
 4   Retargeting                      39286 non-null   object
 5   Interest                         31514 non-null   object
 6   Gender                           7194 non-null    object
 7   MaxAge                           7767 non-null    object
 8   City                             23685 non-null   object
 9   State                            23685 non-null   object
 10  Website                          6235 non-null    object
 11  Language                         2584 non-null    object
 12  Audience Owner                   17859 non-null   object
 13  Location Granularity             29770 non-null   object
 14  Location Type                    29770 non-null   object
 15  Agency                           400 non-null     object
 16  List                             5034 non-null    object
 17  Custom Audience Match Key        1144 non-null    object
 18  Mobile App                       50 non-null      object
 19  Country                          22118 non-null   object
 20  Activity on the Facebook Family  3382 non-null    object
 21  Like                             855 non-null     object
 22  Education                        151 non-null     object
 23  Job Title                        15 non-null      object
 24  Relationship Status              22 non-null      object
 25  Employer                         4 non-null       object
dtypes: object(26)
memory usage: 44.1+ MB