有没有办法获取列表中每个元素的计数,这些元素存储为数据框中的行?
Is there a way to get the count of every element in lists stored as rows in a data frame?
嗨,我正在使用 pandas 来显示和分析 csv 文件,有些列是 'object dtype' 并显示为列表,我使用 'literal_eval'要将名为 'sdgs' 的列的行转换为列表,我的问题是如何使用 'groupby' 或任何其他方式来唯一地显示存储在此列表中的每个元素的计数,特别是因为有许多常见的这些列表之间的元素。
df = pd.read_csv("../input/covid19-public-media-dataset/covid19_articles_20220420.csv")
df.dropna(subset=['sdgs'],inplace=True)
df=df[df.astype(str)['sdgs'] != '[]']
df.sdgs = df.sdgs.apply(literal_eval)
df.reset_index(drop=True, inplace=True)
This is a sample of data and my problem is about the last column
This is an example of the elements I want to count
谢谢
鉴于此示例数据:
import pandas as pd
df = pd.DataFrame({'domain': ['a', 'a', 'b', 'c'],
'sdgs': [['Just', 'a', 'sentence'], ['another', 'sentence'],
['a', 'word', 'and', 'a', 'word'], ['nothing', 'here']]})
print(df)
domain sdgs
0 a [Just, a, sentence]
1 a [another, sentence]
2 b [a, word, and, a, word]
3 c [nothing, here]
要获取 sdgs
列中所有列表的字数,您可以使用 Series.agg
and use collections.Counter
:
连接列表
import collections
word_counts = collections.Counter(df['sdgs'].agg(sum))
print(word_counts)
Counter({'a': 3, 'sentence': 2, 'word': 2, 'Just': 1, 'another': 1,
'and': 1, 'nothing': 1, 'here': 1})
您可以像这样在列表中使用 explode
:
import pandas as pd
from ast import literal_eval
import re
df = pd.DataFrame({'domain': ['a', 'b'], 'sdgs': ["['AaaBbbAndCcc','DddAndEee']","['BbbCccAndDdd']"]})
df
# domain sdgs
# 0 a ['AaaBbbAndCcc','DddAndEee']
# 1 b ['BbbCccAndDdd']
# turn lists into strings, split at capitalized names
df['sdgs']=df.sdgs.apply(lambda x: re.sub( r"([A-Z])", r" ", ''.join(literal_eval(x))).split())
df
# domain sdgs
# 0 a [Aaa, Bbb, And, Ccc, Ddd, And, Eee]
# 1 b [Bbb, Ccc, And, Ddd]
df.explode('sdgs')
# domain sdgs
# 0 a Aaa
# 0 a Bbb
# 0 a And
# 0 a Ccc
# 0 a Ddd
# 0 a And
# 0 a Eee
# 1 b Bbb
# 1 b Ccc
# 1 b And
# 1 b Ddd
现在您可以像这样分组:
df.explode('sdgs').groupby(['domain']).count()
# sdgs
# domain
# a 7
# b 4
编辑:您需要一些其他方法来拆分字符串,也可能需要删除重复值
嗨,我正在使用 pandas 来显示和分析 csv 文件,有些列是 'object dtype' 并显示为列表,我使用 'literal_eval'要将名为 'sdgs' 的列的行转换为列表,我的问题是如何使用 'groupby' 或任何其他方式来唯一地显示存储在此列表中的每个元素的计数,特别是因为有许多常见的这些列表之间的元素。
df = pd.read_csv("../input/covid19-public-media-dataset/covid19_articles_20220420.csv")
df.dropna(subset=['sdgs'],inplace=True)
df=df[df.astype(str)['sdgs'] != '[]']
df.sdgs = df.sdgs.apply(literal_eval)
df.reset_index(drop=True, inplace=True)
This is a sample of data and my problem is about the last column
This is an example of the elements I want to count
谢谢
鉴于此示例数据:
import pandas as pd
df = pd.DataFrame({'domain': ['a', 'a', 'b', 'c'],
'sdgs': [['Just', 'a', 'sentence'], ['another', 'sentence'],
['a', 'word', 'and', 'a', 'word'], ['nothing', 'here']]})
print(df)
domain sdgs
0 a [Just, a, sentence]
1 a [another, sentence]
2 b [a, word, and, a, word]
3 c [nothing, here]
要获取 sdgs
列中所有列表的字数,您可以使用 Series.agg
and use collections.Counter
:
import collections
word_counts = collections.Counter(df['sdgs'].agg(sum))
print(word_counts)
Counter({'a': 3, 'sentence': 2, 'word': 2, 'Just': 1, 'another': 1,
'and': 1, 'nothing': 1, 'here': 1})
您可以像这样在列表中使用 explode
:
import pandas as pd
from ast import literal_eval
import re
df = pd.DataFrame({'domain': ['a', 'b'], 'sdgs': ["['AaaBbbAndCcc','DddAndEee']","['BbbCccAndDdd']"]})
df
# domain sdgs
# 0 a ['AaaBbbAndCcc','DddAndEee']
# 1 b ['BbbCccAndDdd']
# turn lists into strings, split at capitalized names
df['sdgs']=df.sdgs.apply(lambda x: re.sub( r"([A-Z])", r" ", ''.join(literal_eval(x))).split())
df
# domain sdgs
# 0 a [Aaa, Bbb, And, Ccc, Ddd, And, Eee]
# 1 b [Bbb, Ccc, And, Ddd]
df.explode('sdgs')
# domain sdgs
# 0 a Aaa
# 0 a Bbb
# 0 a And
# 0 a Ccc
# 0 a Ddd
# 0 a And
# 0 a Eee
# 1 b Bbb
# 1 b Ccc
# 1 b And
# 1 b Ddd
现在您可以像这样分组:
df.explode('sdgs').groupby(['domain']).count()
# sdgs
# domain
# a 7
# b 4
编辑:您需要一些其他方法来拆分字符串,也可能需要删除重复值