有没有办法获取列表中每个元素的计数，这些元素存储为数据框中的行？

Question

嗨，我正在使用 pandas 来显示和分析 csv 文件，有些列是 'object dtype' 并显示为列表，我使用 'literal_eval'要将名为 'sdgs' 的列的行转换为列表，我的问题是如何使用 'groupby' 或任何其他方式来唯一地显示存储在此列表中的每个元素的计数，特别是因为有许多常见的这些列表之间的元素。

df = pd.read_csv("../input/covid19-public-media-dataset/covid19_articles_20220420.csv")
df.dropna(subset=['sdgs'],inplace=True)
df=df[df.astype(str)['sdgs'] != '[]']
df.sdgs = df.sdgs.apply(literal_eval)
df.reset_index(drop=True, inplace=True)

This is a sample of data and my problem is about the last column

This is an example of the elements I want to count

谢谢

Answer 1

鉴于此示例数据：

import pandas as pd

df = pd.DataFrame({'domain': ['a', 'a', 'b', 'c'], 
                   'sdgs': [['Just', 'a', 'sentence'], ['another', 'sentence'], 
                            ['a', 'word', 'and', 'a', 'word'], ['nothing', 'here']]})
print(df)

  domain                     sdgs
0      a      [Just, a, sentence]
1      a      [another, sentence]
2      b  [a, word, and, a, word]
3      c          [nothing, here]

要获取 sdgs 列中所有列表的字数，您可以使用 Series.agg and use collections.Counter:

连接列表

import collections

word_counts = collections.Counter(df['sdgs'].agg(sum))
print(word_counts)

Counter({'a': 3, 'sentence': 2, 'word': 2, 'Just': 1, 'another': 1, 
         'and': 1, 'nothing': 1, 'here': 1})

Answer 2

您可以像这样在列表中使用 explode：

import pandas as pd
from ast import literal_eval
import re

df = pd.DataFrame({'domain': ['a', 'b'], 'sdgs': ["['AaaBbbAndCcc','DddAndEee']","['BbbCccAndDdd']"]})
df
#   domain                          sdgs
# 0      a  ['AaaBbbAndCcc','DddAndEee']
# 1      b              ['BbbCccAndDdd']

# turn lists into strings, split at capitalized names
df['sdgs']=df.sdgs.apply(lambda x: re.sub( r"([A-Z])", r" ", ''.join(literal_eval(x))).split())
df
#   domain                                 sdgs
# 0      a  [Aaa, Bbb, And, Ccc, Ddd, And, Eee]
# 1      b                 [Bbb, Ccc, And, Ddd]

df.explode('sdgs')
#   domain sdgs
# 0      a  Aaa
# 0      a  Bbb
# 0      a  And
# 0      a  Ccc
# 0      a  Ddd
# 0      a  And
# 0      a  Eee
# 1      b  Bbb
# 1      b  Ccc
# 1      b  And
# 1      b  Ddd

现在您可以像这样分组：

df.explode('sdgs').groupby(['domain']).count()
#         sdgs
# domain      
# a          7
# b          4

编辑：您需要一些其他方法来拆分字符串，也可能需要删除重复值

有没有办法获取列表中每个元素的计数，这些元素存储为数据框中的行？

Is there a way to get the count of every element in lists stored as rows in a data frame?

python

csv

pandas

pandas-groupby

dtype