如何将其转换为数据框并将其另存为 csv?
How to transform this into a dataframe and save it as a csv?
此数据之前作为 .txt
文件提供。我将其转换为 .csv
格式并尝试将其排序为所需的形式,但失败了。我正在尝试找到转换此数据结构的方法(如下所示):
bakeryA
77300 Baker Street
bun: [10,20,30,10]
donut: [20,10,40,0]
bread: [0,10,15,10]
bakery B
78100 Cerabut St
data not available
bakery C
80300 Sulkeh St
bun: [29,50,20,30]
donut: [10,10,30,10]
bread: [10,15,10,20]
进入此数据框:
Name
Address
type
salt
sugar
water
flour
Bakery A
77300 Baker Street
bun
10
20
30
10
Bakery A
77300 Baker Street
donut
20
10
40
0
Bakery A
77300 Baker Street
bread
0
10
15
10
Bakery B
78100 Cerabut St
Nan
Nan
Nan
Nan
Nan
Bakery C
80300 Sulkeh St
bun
29
50
20
30
Bakery C
80300 Sulkeh St
donut
10
10
30
10
Bakery C
80300 Sulkeh St
bread
10
15
10
20
谢谢!
这与 pandas 关系不大,更多的是将非结构化源解析为结构化数据。试试这个:
from ast import literal_eval
from enum import IntEnum
class LineType(IntEnum):
BakeryName = 1
Address = 2
Ingredients = 3
data = []
with open('data.txt') as fp:
line_type = LineType.BakeryName
for line in fp:
line = line.strip()
if line_type == LineType.BakeryName:
name = line # the current line contains the Bakery Name
line_type = LineType.Address # the next line is the Bakery Address
elif line_type == LineType.Address:
address = line # the current line contains the Bakery Address
line_type = LineType.Ingredients # the next line contains the Ingredients
elif line_type == LineType.Ingredients and line == 'data not available':
data.append({
'Name': name,
'Address': address
}) # no Ingredients info available
line_type = LineType.BakeryName # next line is Bakery Name
elif line_type == LineType.Ingredients:
# if the line does not follow the ingredient's format, we
# overstepped into the Bakery Name line. Then the next line
# is Bakery Address
try:
bakery_type, ingredients = line.split(':')
ingredients = literal_eval(ingredients.strip())
data.append({
'Name': name,
'Address': address,
'type': bakery_type,
'salt': ingredients[0],
'sugar': ingredients[1],
'water': ingredients[2],
'flour': ingredients[3],
})
except:
name = line
line_type = LineType.Address
df = pd.DataFrame(data)
假设您的数据文件采用所示格式。稍有偏差(例如空行)就会导致错误。
此数据之前作为 .txt
文件提供。我将其转换为 .csv
格式并尝试将其排序为所需的形式,但失败了。我正在尝试找到转换此数据结构的方法(如下所示):
bakeryA
77300 Baker Street
bun: [10,20,30,10]
donut: [20,10,40,0]
bread: [0,10,15,10]
bakery B
78100 Cerabut St
data not available
bakery C
80300 Sulkeh St
bun: [29,50,20,30]
donut: [10,10,30,10]
bread: [10,15,10,20]
进入此数据框:
Name | Address | type | salt | sugar | water | flour |
---|---|---|---|---|---|---|
Bakery A | 77300 Baker Street | bun | 10 | 20 | 30 | 10 |
Bakery A | 77300 Baker Street | donut | 20 | 10 | 40 | 0 |
Bakery A | 77300 Baker Street | bread | 0 | 10 | 15 | 10 |
Bakery B | 78100 Cerabut St | Nan | Nan | Nan | Nan | Nan |
Bakery C | 80300 Sulkeh St | bun | 29 | 50 | 20 | 30 |
Bakery C | 80300 Sulkeh St | donut | 10 | 10 | 30 | 10 |
Bakery C | 80300 Sulkeh St | bread | 10 | 15 | 10 | 20 |
谢谢!
这与 pandas 关系不大,更多的是将非结构化源解析为结构化数据。试试这个:
from ast import literal_eval
from enum import IntEnum
class LineType(IntEnum):
BakeryName = 1
Address = 2
Ingredients = 3
data = []
with open('data.txt') as fp:
line_type = LineType.BakeryName
for line in fp:
line = line.strip()
if line_type == LineType.BakeryName:
name = line # the current line contains the Bakery Name
line_type = LineType.Address # the next line is the Bakery Address
elif line_type == LineType.Address:
address = line # the current line contains the Bakery Address
line_type = LineType.Ingredients # the next line contains the Ingredients
elif line_type == LineType.Ingredients and line == 'data not available':
data.append({
'Name': name,
'Address': address
}) # no Ingredients info available
line_type = LineType.BakeryName # next line is Bakery Name
elif line_type == LineType.Ingredients:
# if the line does not follow the ingredient's format, we
# overstepped into the Bakery Name line. Then the next line
# is Bakery Address
try:
bakery_type, ingredients = line.split(':')
ingredients = literal_eval(ingredients.strip())
data.append({
'Name': name,
'Address': address,
'type': bakery_type,
'salt': ingredients[0],
'sugar': ingredients[1],
'water': ingredients[2],
'flour': ingredients[3],
})
except:
name = line
line_type = LineType.Address
df = pd.DataFrame(data)
假设您的数据文件采用所示格式。稍有偏差(例如空行)就会导致错误。