在 pandas 中的归一化 json 上找到 groupby 之后最长的组

Find the longest group after groupby on normalized json in pandas

我下面的代码按值分组并创建一个值列表,这些值曾经是数组的长度。但是如何才能return元素中每个数的和最大的id:

原始Json读入df(与打印的数据不一样,因为它太长了)

{
   "kind":"admin#reports#activities",
   "etag":"\"5g8\"",
   "nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
   "items":[
      {
         "kind":"admin#reports#activity",
         "id":{
            "time":"2022-05-05T23:59:39.421Z",
            "uniqueQualifier":"5526793068617678141",
            "applicationName":"token",
            "customerId":"cds"
         },
         "etag":"\"jkYcURYoi8\"",
         "actor":{
            "email":"blah@blah.net",
            "profileId":"1323"
         },
         "ipAddress":"107.178.193.87",
         "events":[
            {
               "type":"auth",
               "name":"activity",
               "parameters":[
                  {
                     "name":"api_name",
                     "value":"admin"
                  },
                  {
                     "name":"method_name",
                     "value":"directory.users.list"
                  },
                  {
                     "name":"client_id",
                     "value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
                  },
                  {
                     "name":"num_response_bytes",
                     "intValue":"7158"
                  },
                  {
                     "name":"product_bucket",
                     "value":"GSUITE_ADMIN"
                  },
                  {
                     "name":"app_name",
                     "value":"Untitled project"
                  },
                  {
                     "name":"client_type",
                     "value":"WEB"
                  }
               ]
            }
         ]
      },
      {
         "kind":"admin#reports#activity",
         "id":{
            "time":"2022-05-05T23:58:48.914Z",
            "uniqueQualifier":"-4002873813067783265",
            "applicationName":"token",
            "customerId":"df"
         },
         "etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
         "actor":{
            "email":"blah.blah@bebe.net",
            "profileId":"1324"
         },
         "ipAddress":"54.80.168.30",
         "events":[
            {
               "type":"auth",
               "name":"activity",
               "parameters":[
                  {
                     "name":"api_name",
                     "value":"gmail"
                  },
                  {
                     "name":"method_name",
                     "value":"gmail.users.messages.list"
                  },
                  {
                     "name":"client_id",
                     "value":"927538837578.apps.googleusercontent.com"
                  },
                  {
                     "name":"num_response_bytes",
                     "intValue":"2"
                  },
                  {
                     "name":"product_bucket",
                     "value":"GMAIL"
                  },
                  
                  {
                     "name":"client_type",
                     "value":"WEB"
                  }
               ]
            }
         ]
      }
   ]
}

当前代码:

df = pd.json_normalize(response['items'])
    df['test'] = df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])

输出:

ID
1002306    [7, 7, 7, 5]
1234444    [3,5,6]
1222222    [1,3,4,5]

期望的输出

id       total
1002306  26

抱歉不得不再填一些space,因为代码太多了

不需要构造中间体df然后在上面做groupby。您可以使用将记录和元路径传递给 json_normalize 来直接展平 json 数据。那么您的工作似乎是计算每个 actor.profileId 的行数并找到最大值。

df = pd.json_normalize(response['items'], ['events','parameters'], ['actor'])
df['actor.profileId'] = df['actor'].str['profileId']
out = df.value_counts('actor.profileId').pipe(lambda x: x.iloc[[0]])

输出:

actor.profileId
1323    7
dtype: int64