如何从 pandas 数据框中的文本字段中提取数据?
How to extract data from text field in pandas dataframe?
我想从此数据框中获取标签分布:
df=pd.DataFrame([
[43,{"tags":["webcom","start","temp","webcomfoto","dance"],"image":["https://image.com/Kqk.jpg"]}],
[83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
[76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
[77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
[81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
我需要得到一个 table,其中包含带有特定数量标签的“id”。
例如
Number of posts | Number of tags
31 9
44 8
...
129 1
我在 'tags' 是唯一字段的情况下使用了 。在这个数据框中,我还有 'image'、'users' 和其他带有值的文本字段。在这种情况下我应该如何处理数据?
谢谢
坚持使用 collections.Counter
,这是一种方法:
from collections import Counter
from operator import itemgetter
c = Counter(map(len, map(itemgetter('tags'), df['tags'])))
res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['Tags', 'Posts']
print(res)
Tags Posts
0 5 2
1 3 1
2 2 1
3 1 1
您可以使用 str 访问器获取字典键和 len
以及 value_counts
:
df.tags.str['tags'].str.len().value_counts()\
.rename('Posts')\
.rename_axis('Tags')\
.reset_index()
输出:
Tags Posts
0 5 2
1 3 1
2 2 1
3 1 1
更新:使用 f 字符串、字典理解和列表理解的组合来简洁地提取 tags
列中所有列表的长度:
extract_dict = [{f'count {y}':len(z) for y,z in x.items()} for x in df.tags]
# construct new df with only extracted counts
pd.DataFrame.from_records(extract_dict)
# new df with extracted counts & original data
df.assign(**pd.DataFrame.from_records(extract_dict))
# outputs:
_id tags count image \
0 43 {'tags': ['webcom', 'start', 'temp', 'webcomfo... 1.0
1 83 {'tags': ['yourself', 'start', ''], 'image': [... 1.0
2 76 {'tags': ['en', 'webcom'], 'links': ['http://w... NaN
3 77 {'tags': ['webcomznakomstvo', 'webcomzhiznx', ... 2.0
4 81 {'tags': ['webcomfotografiya'], 'users': ['mys... NaN
count links count tags count users
0 NaN 5 NaN
1 NaN 3 NaN
2 2.0 2 1.0
3 NaN 5 NaN
4 1.0 1 2.0
原答案:
如果您事先知道列名,则可以使用列表理解来完成此任务
extract = [(len(x.get('tags',[])), len(x.get('images',[])), len(x.get('users',[])))
for x in df.tags]
# extract outputs:
[(5, 0, 0), (3, 0, 0), (2, 0, 1), (5, 0, 0), (1, 0, 2)]
然后可用于创建新数据框或分配其他列
# creates new df
pd.DataFrame.from_records(
extract,
columns=['count tags', 'count images', 'count users']
)
# creates new dataframe with extracted data and original df
df.assign(
**pd.DataFrame.from_records(
extract,
columns=['count tags', 'count images', 'count users'])
)
最后一条语句产生了以下输出:
_id tags count tags \
0 43 {'tags': ['webcom', 'start', 'temp', 'webcomfo... 5
1 83 {'tags': ['yourself', 'start', ''], 'image': [... 3
2 76 {'tags': ['en', 'webcom'], 'links': ['http://w... 2
3 77 {'tags': ['webcomznakomstvo', 'webcomzhiznx', ... 5
4 81 {'tags': ['webcomfotografiya'], 'users': ['mys... 1
count images count users
0 0 0
1 0 0
2 0 1
3 0 0
4 0 2
您在第 tags
列中的数据有问题 strings
,不是 dictionaries
。
所以需要第一步:
import ast
df['tags'] = df['tags'].apply(ast.literal_eval)
然后应用原始答案,如果有多个字段,效果非常好。
正在验证:
df=pd.DataFrame([
[43,{"tags":[],"image":["https://image.com/Kqk.jpg"]}],
[83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
[76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
[77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
[81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
#print (df)
#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)
print (df['tags'].apply(type))
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>
4 <class 'str'>
Name: tags, dtype: object
#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)
print (df['tags'].apply(type))
0 <class 'dict'>
1 <class 'dict'>
2 <class 'dict'>
3 <class 'dict'>
4 <class 'dict'>
Name: tags, dtype: object
c = Counter([len(x['tags']) for x in df['tags']])
df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
Number of posts Number of tags
0 1 0
1 1 3
2 1 2
3 1 5
4 1 1
我想从此数据框中获取标签分布:
df=pd.DataFrame([
[43,{"tags":["webcom","start","temp","webcomfoto","dance"],"image":["https://image.com/Kqk.jpg"]}],
[83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
[76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
[77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
[81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
我需要得到一个 table,其中包含带有特定数量标签的“id”。 例如
Number of posts | Number of tags
31 9
44 8
...
129 1
我在 'tags' 是唯一字段的情况下使用了
谢谢
坚持使用 collections.Counter
,这是一种方法:
from collections import Counter
from operator import itemgetter
c = Counter(map(len, map(itemgetter('tags'), df['tags'])))
res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['Tags', 'Posts']
print(res)
Tags Posts
0 5 2
1 3 1
2 2 1
3 1 1
您可以使用 str 访问器获取字典键和 len
以及 value_counts
:
df.tags.str['tags'].str.len().value_counts()\
.rename('Posts')\
.rename_axis('Tags')\
.reset_index()
输出:
Tags Posts
0 5 2
1 3 1
2 2 1
3 1 1
更新:使用 f 字符串、字典理解和列表理解的组合来简洁地提取 tags
列中所有列表的长度:
extract_dict = [{f'count {y}':len(z) for y,z in x.items()} for x in df.tags]
# construct new df with only extracted counts
pd.DataFrame.from_records(extract_dict)
# new df with extracted counts & original data
df.assign(**pd.DataFrame.from_records(extract_dict))
# outputs:
_id tags count image \
0 43 {'tags': ['webcom', 'start', 'temp', 'webcomfo... 1.0
1 83 {'tags': ['yourself', 'start', ''], 'image': [... 1.0
2 76 {'tags': ['en', 'webcom'], 'links': ['http://w... NaN
3 77 {'tags': ['webcomznakomstvo', 'webcomzhiznx', ... 2.0
4 81 {'tags': ['webcomfotografiya'], 'users': ['mys... NaN
count links count tags count users
0 NaN 5 NaN
1 NaN 3 NaN
2 2.0 2 1.0
3 NaN 5 NaN
4 1.0 1 2.0
原答案:
如果您事先知道列名,则可以使用列表理解来完成此任务
extract = [(len(x.get('tags',[])), len(x.get('images',[])), len(x.get('users',[])))
for x in df.tags]
# extract outputs:
[(5, 0, 0), (3, 0, 0), (2, 0, 1), (5, 0, 0), (1, 0, 2)]
然后可用于创建新数据框或分配其他列
# creates new df
pd.DataFrame.from_records(
extract,
columns=['count tags', 'count images', 'count users']
)
# creates new dataframe with extracted data and original df
df.assign(
**pd.DataFrame.from_records(
extract,
columns=['count tags', 'count images', 'count users'])
)
最后一条语句产生了以下输出:
_id tags count tags \
0 43 {'tags': ['webcom', 'start', 'temp', 'webcomfo... 5
1 83 {'tags': ['yourself', 'start', ''], 'image': [... 3
2 76 {'tags': ['en', 'webcom'], 'links': ['http://w... 2
3 77 {'tags': ['webcomznakomstvo', 'webcomzhiznx', ... 5
4 81 {'tags': ['webcomfotografiya'], 'users': ['mys... 1
count images count users
0 0 0
1 0 0
2 0 1
3 0 0
4 0 2
您在第 tags
列中的数据有问题 strings
,不是 dictionaries
。
所以需要第一步:
import ast
df['tags'] = df['tags'].apply(ast.literal_eval)
然后应用原始答案,如果有多个字段,效果非常好。
正在验证:
df=pd.DataFrame([
[43,{"tags":[],"image":["https://image.com/Kqk.jpg"]}],
[83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
[76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
[77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
[81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
#print (df)
#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)
print (df['tags'].apply(type))
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>
4 <class 'str'>
Name: tags, dtype: object
#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)
print (df['tags'].apply(type))
0 <class 'dict'>
1 <class 'dict'>
2 <class 'dict'>
3 <class 'dict'>
4 <class 'dict'>
Name: tags, dtype: object
c = Counter([len(x['tags']) for x in df['tags']])
df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
Number of posts Number of tags
0 1 0
1 1 3
2 1 2
3 1 5
4 1 1