Python Pandas 使用文本文件创建 Dataframe
Python Pandas Create Dataframe using a text file
我正在尝试使用 Pandas 从原始文本文件创建数据框。该文件包括 3 个类别,类别名称后有与每个类别相关的项目。我能够基于类别创建一个系列,但不知道如何将每个项目类型关联到它们各自的类别并从中创建一个数据框。下面是我的初始代码以及所需的数据帧输出。你能帮我指导一下正确的方法吗?
category = ['Fruits', 'Vegetables', 'Meats']
items='''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb'''
Category = pd.Series()
i = 0
for item in items.splitlines():
if item in category:
Category = Category.set_value(i, item)
i += 1
df = pd.DataFrame(Category)
print(df)
所需的数据帧输出:
Category Item
Fruits apple
orange
pear
Vegetables broccoli
squash
carrot
Meats chicken
beef
lamb
考虑迭代地附加到列表字典而不是系列。然后,将字典转换为数据框。下面的 key 用于输出所需的结果,因为您需要这样的分组的数字:
from io import StringIO
import pandas as pd
txtobj = StringIO('''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb''')
items = {'Category':[], 'Item':[]}
for line in txtobj:
curr_line = line.replace('\n','')
if curr_line in ['Fruits','Vegetables', 'Meats']:
curr_category = curr_line
if curr_category != curr_line:
items['Category'].append(curr_category)
items['Item'].append(curr_line)
df = pd.DataFrame(items).assign(key=1)
print(df)
# Category Item key
# 0 Fruits apple 1
# 1 Fruits orange 1
# 2 Fruits pear 1
# 3 Vegetables broccoli 1
# 4 Vegetables squash 1
# 5 Vegetables carrot 1
# 6 Meats chicken 1
# 7 Meats beef 1
# 8 Meats lamb 1
print(df['key'].groupby([df['Category'], df['Item']]).count())
# Category Item
# Fruits apple 1
# orange 1
# pear 1
# Meats beef 1
# chicken 1
# lamb 1
# Vegetables broccoli 1
# carrot 1
# squash 1
# Name: key, dtype: int64
这是一个没有 for 循环的解决方案,使用 pandas。
import pandas as pd
category = ['Fruits', 'Vegetables', 'Meats']
items='''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb'''
in_df = pd.DataFrame(items.splitlines())
根据该行是否在类别中创建组。
in_df = in_df.assign(group=in_df.isin(category).cumsum())
从每组的第一行创建一个数据框
cat_df = in_df.groupby('group').first()
将每个组的第二行连接回第一行,创建类别水果关系
df_out = in_df.groupby('group').apply(lambda x: x[1:]).reset_index(drop = True).merge(cat_df, left_on='group', right_index=True)
删除分组键并重命名列
df_out = df_out.drop('group',axis=1).rename(columns={'0_x':'Fruit','0_y':'Category'})
print(df_out)
输出:
Fruit Category
0 apple Fruits
1 orange Fruits
2 pear Fruits
3 broccoli Vegetables
4 squash Vegetables
5 carrot Vegetables
6 chicken Meats
7 beef Meats
8 lamb Meats
使用:
- 通过
isin
为检查类别创建掩码
insert
new column by where
and ffill
(fillna
方法 ffill
)
- 通过
boolean indexing
and last use reset_index
删除两列中的相同值以获得唯一的单调默认索引。
category = ['Fruits', 'Vegetables', 'Meats']
items='''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb'''
df = pd.DataFrame({'Fruit':items.splitlines()})
mask = df['Fruit'].isin(category)
df.insert(0,'Category', df['Fruit'].where(mask).ffill())
df = df[df['Category'] != df['Fruit']].reset_index(drop=True)
print (df)
Category Fruit
0 Fruits apple
1 Fruits orange
2 Fruits pear
3 Vegetables broccoli
4 Vegetables squash
5 Vegetables carrot
6 Meats chicken
7 Meats beef
8 Meats lamb
如有必要,最后计数 Categories
和 Fruits
使用 groupby
and size
:
df1 = df.groupby(['Category','Fruit']).size()
print (df1)
Category Fruit
Fruits apple 1
orange 1
pear 1
Meats beef 1
chicken 1
lamb 1
Vegetables broccoli 1
carrot 1
squash 1
dtype: int64
我正在尝试使用 Pandas 从原始文本文件创建数据框。该文件包括 3 个类别,类别名称后有与每个类别相关的项目。我能够基于类别创建一个系列,但不知道如何将每个项目类型关联到它们各自的类别并从中创建一个数据框。下面是我的初始代码以及所需的数据帧输出。你能帮我指导一下正确的方法吗?
category = ['Fruits', 'Vegetables', 'Meats']
items='''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb'''
Category = pd.Series()
i = 0
for item in items.splitlines():
if item in category:
Category = Category.set_value(i, item)
i += 1
df = pd.DataFrame(Category)
print(df)
所需的数据帧输出:
Category Item
Fruits apple
orange
pear
Vegetables broccoli
squash
carrot
Meats chicken
beef
lamb
考虑迭代地附加到列表字典而不是系列。然后,将字典转换为数据框。下面的 key 用于输出所需的结果,因为您需要这样的分组的数字:
from io import StringIO
import pandas as pd
txtobj = StringIO('''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb''')
items = {'Category':[], 'Item':[]}
for line in txtobj:
curr_line = line.replace('\n','')
if curr_line in ['Fruits','Vegetables', 'Meats']:
curr_category = curr_line
if curr_category != curr_line:
items['Category'].append(curr_category)
items['Item'].append(curr_line)
df = pd.DataFrame(items).assign(key=1)
print(df)
# Category Item key
# 0 Fruits apple 1
# 1 Fruits orange 1
# 2 Fruits pear 1
# 3 Vegetables broccoli 1
# 4 Vegetables squash 1
# 5 Vegetables carrot 1
# 6 Meats chicken 1
# 7 Meats beef 1
# 8 Meats lamb 1
print(df['key'].groupby([df['Category'], df['Item']]).count())
# Category Item
# Fruits apple 1
# orange 1
# pear 1
# Meats beef 1
# chicken 1
# lamb 1
# Vegetables broccoli 1
# carrot 1
# squash 1
# Name: key, dtype: int64
这是一个没有 for 循环的解决方案,使用 pandas。
import pandas as pd
category = ['Fruits', 'Vegetables', 'Meats']
items='''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb'''
in_df = pd.DataFrame(items.splitlines())
根据该行是否在类别中创建组。
in_df = in_df.assign(group=in_df.isin(category).cumsum())
从每组的第一行创建一个数据框
cat_df = in_df.groupby('group').first()
将每个组的第二行连接回第一行,创建类别水果关系
df_out = in_df.groupby('group').apply(lambda x: x[1:]).reset_index(drop = True).merge(cat_df, left_on='group', right_index=True)
删除分组键并重命名列
df_out = df_out.drop('group',axis=1).rename(columns={'0_x':'Fruit','0_y':'Category'})
print(df_out)
输出:
Fruit Category
0 apple Fruits
1 orange Fruits
2 pear Fruits
3 broccoli Vegetables
4 squash Vegetables
5 carrot Vegetables
6 chicken Meats
7 beef Meats
8 lamb Meats
使用:
- 通过
isin
为检查类别创建掩码 insert
new column bywhere
andffill
(fillna
方法ffill
)- 通过
boolean indexing
and last usereset_index
删除两列中的相同值以获得唯一的单调默认索引。
category = ['Fruits', 'Vegetables', 'Meats']
items='''Fruits
apple
orange
pear
Vegetables
broccoli
squash
carrot
Meats
chicken
beef
lamb'''
df = pd.DataFrame({'Fruit':items.splitlines()})
mask = df['Fruit'].isin(category)
df.insert(0,'Category', df['Fruit'].where(mask).ffill())
df = df[df['Category'] != df['Fruit']].reset_index(drop=True)
print (df)
Category Fruit
0 Fruits apple
1 Fruits orange
2 Fruits pear
3 Vegetables broccoli
4 Vegetables squash
5 Vegetables carrot
6 Meats chicken
7 Meats beef
8 Meats lamb
如有必要,最后计数 Categories
和 Fruits
使用 groupby
and size
:
df1 = df.groupby(['Category','Fruit']).size()
print (df1)
Category Fruit
Fruits apple 1
orange 1
pear 1
Meats beef 1
chicken 1
lamb 1
Vegetables broccoli 1
carrot 1
squash 1
dtype: int64