将格式不一致的 csv 文件读入 Pandas 数据框(带有标题和重复列 headers 的块)
Read inconsistently formatted csv file into Pandas Dataframe (blocks with headline and repeating column headers)
我有一个 CSV 文件,基本上如下所示(我将其缩短为显示结构的最小示例):
ID1#First_Name
TIME_BIN,COUNT,AVG
09:00-12:00,100,50
15:00-18:00,24,14
21:00-23:00,69,47
ID2#Second_Name
TIME_BIN,COUNT,AVG
09:00-12:00,36,5
15:00-18:00,74,68
21:00-23:00,22,76
ID3#Third_Name
TIME_BIN,COUNT,AVG
09:00-12:00,15,10
15:00-18:00,77,36
21:00-23:00,55,18
可以看到,数据被分成多个块。每个块都有一个标题(例如 ID1#First_Name
),其中包含两个信息和平(IDx
和 x_Name
),由 #
.
分隔
每个标题后跟 headers 列(TIME_BIN, COUNT, AVG
),所有块都保持相同。
然后跟随属于列headers的一些数据行(例如TIME_BIN=09:00-12:00
、COUNT=100
、AVG=50
)。
我想将此文件解析为 Pandas 数据框,如下所示:
ID Name TIME_BIN COUNT AVG
ID1 First_Name 09:00-12:00 100 50
ID1 First_Name 15:00-18:00 24 14
ID1 First_Name 21:00-23:00 69 47
ID2 Second_Name 09:00-12:00 36 5
ID2 Second_Name 15:00-18:00 74 68
ID2 Second_Name 21:00-23:00 22 76
ID3 Third_Name 09:00-12:00 15 10
ID3 Third_Name 15:00-18:00 77 36
ID3 Third_Name 21:00-23:00 55 18
这意味着标题可能不会被跳过,但必须被 #
分割,然后链接到它所属的块中的数据。此外,列 headers 只需要一次,因为它们以后不会更改。
不知何故,我设法用下面的代码实现了我的目标。然而,这种方法看起来有点过于复杂而且对我来说不够稳健,我相信有更好的方法可以做到这一点。欢迎提出任何建议!
import pandas as pd
from io import StringIO (<- Python 3, for Python 2 use from StringIO import StringIO)
pathToFile = 'mydata.txt'
# read the textfile into a StringIO object and skip the repeating column header rows
s = StringIO()
with open(pathToFile) as file:
for line in file:
if not line.startswith('TIME_BIN'):
s.write(line)
# reset buffer to the beginning of the StringIO object
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['TIME_BIN', 'COUNT', 'AVG'])
# split the headline string which is currently found in the TIME_BIN column and insert both parts as new dataframe columns.
# the headline is identified by its start which is 'ID'
df['ID'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(0)
df['Name'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(1)
# fill the NaN values in the ID and Name columns by propagating the last valid observation
df['ID'] = df['ID'].fillna(method='ffill')
df['Name'] = df['Name'].fillna(method='ffill')
# remove all rows where TIME_BIN starts with 'ID'
df['TIME_BIN'] = df['TIME_BIN'].drop(df[df.TIME_BIN.str.startswith('ID')].index)
df = df.dropna(subset=['TIME_BIN'])
# reorder columns to bring ID and Name to the front
cols = list(df)
cols.insert(0, cols.pop(cols.index('Name')))
cols.insert(0, cols.pop(cols.index('ID')))
df = df.ix[:, cols]
import pandas as pd
from StringIO import StringIO
import sys
pathToFile = 'mydata.txt'
f = open(pathToFile)
s = StringIO()
cur_ID = None
for ln in f:
if not ln.strip():
continue
if ln.startswith('ID'):
cur_ID = ln.replace('\n',',',1).replace('#',',',1)
continue
if ln.startswith('TIME'):
continue
if cur_ID is None:
print 'NO ID found'
sys.exit(1)
s.write(cur_ID + ln)
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['ID','Name','TIME_BIN', 'COUNT', 'AVG'])
我有一个 CSV 文件,基本上如下所示(我将其缩短为显示结构的最小示例):
ID1#First_Name
TIME_BIN,COUNT,AVG
09:00-12:00,100,50
15:00-18:00,24,14
21:00-23:00,69,47
ID2#Second_Name
TIME_BIN,COUNT,AVG
09:00-12:00,36,5
15:00-18:00,74,68
21:00-23:00,22,76
ID3#Third_Name
TIME_BIN,COUNT,AVG
09:00-12:00,15,10
15:00-18:00,77,36
21:00-23:00,55,18
可以看到,数据被分成多个块。每个块都有一个标题(例如 ID1#First_Name
),其中包含两个信息和平(IDx
和 x_Name
),由 #
.
每个标题后跟 headers 列(TIME_BIN, COUNT, AVG
),所有块都保持相同。
然后跟随属于列headers的一些数据行(例如TIME_BIN=09:00-12:00
、COUNT=100
、AVG=50
)。
我想将此文件解析为 Pandas 数据框,如下所示:
ID Name TIME_BIN COUNT AVG
ID1 First_Name 09:00-12:00 100 50
ID1 First_Name 15:00-18:00 24 14
ID1 First_Name 21:00-23:00 69 47
ID2 Second_Name 09:00-12:00 36 5
ID2 Second_Name 15:00-18:00 74 68
ID2 Second_Name 21:00-23:00 22 76
ID3 Third_Name 09:00-12:00 15 10
ID3 Third_Name 15:00-18:00 77 36
ID3 Third_Name 21:00-23:00 55 18
这意味着标题可能不会被跳过,但必须被 #
分割,然后链接到它所属的块中的数据。此外,列 headers 只需要一次,因为它们以后不会更改。
不知何故,我设法用下面的代码实现了我的目标。然而,这种方法看起来有点过于复杂而且对我来说不够稳健,我相信有更好的方法可以做到这一点。欢迎提出任何建议!
import pandas as pd
from io import StringIO (<- Python 3, for Python 2 use from StringIO import StringIO)
pathToFile = 'mydata.txt'
# read the textfile into a StringIO object and skip the repeating column header rows
s = StringIO()
with open(pathToFile) as file:
for line in file:
if not line.startswith('TIME_BIN'):
s.write(line)
# reset buffer to the beginning of the StringIO object
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['TIME_BIN', 'COUNT', 'AVG'])
# split the headline string which is currently found in the TIME_BIN column and insert both parts as new dataframe columns.
# the headline is identified by its start which is 'ID'
df['ID'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(0)
df['Name'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(1)
# fill the NaN values in the ID and Name columns by propagating the last valid observation
df['ID'] = df['ID'].fillna(method='ffill')
df['Name'] = df['Name'].fillna(method='ffill')
# remove all rows where TIME_BIN starts with 'ID'
df['TIME_BIN'] = df['TIME_BIN'].drop(df[df.TIME_BIN.str.startswith('ID')].index)
df = df.dropna(subset=['TIME_BIN'])
# reorder columns to bring ID and Name to the front
cols = list(df)
cols.insert(0, cols.pop(cols.index('Name')))
cols.insert(0, cols.pop(cols.index('ID')))
df = df.ix[:, cols]
import pandas as pd
from StringIO import StringIO
import sys
pathToFile = 'mydata.txt'
f = open(pathToFile)
s = StringIO()
cur_ID = None
for ln in f:
if not ln.strip():
continue
if ln.startswith('ID'):
cur_ID = ln.replace('\n',',',1).replace('#',',',1)
continue
if ln.startswith('TIME'):
continue
if cur_ID is None:
print 'NO ID found'
sys.exit(1)
s.write(cur_ID + ln)
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['ID','Name','TIME_BIN', 'COUNT', 'AVG'])