在加载到数据帧之前读取需要数据清理的 CSV 文件
Read CSV file that needs data sanitization prior to loading into dataframe
我正在将 CSV 文件读入 pandas。问题是文件需要删除行和其他行上的计算值。我现在的想法是这样开始的
with open(down_path.name) as csv_file:
rdr = csv.DictReader(csv_file)
for row in rdr:
type = row['']
if type == 'Summary':
current_ward = row['Name']
else:
name = row['Name']
count1 = row['Count1']
count2 = row['Count2']
count3 = row['Count3']
index_count += 1
# write to someplace
,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0
最终结果需要在一个数据帧中结束,我可以将其连接到现有数据帧。
做这件事的脑残方法就是简单地进行我的转换并创建一个新的 CSV 文件,然后读入它。似乎是一种非 pythonic 的方法。
需要删除摘要行,合并名称相似的行(Aloha 1 和 Aloha I),删除个人统计信息并将 Aloha 1 标签贴在每个人身上。另外我需要添加这个数据来自哪个月。如您所见,数据需要一些工作:)
期望的输出是
Jan-16,阿罗哈 1,约翰,1,2,3
Aloha 1 来自上面的摘要行
我个人的偏好是在 Pandas 中完成所有事情。
也许是这样的……
# imports
import numpy as np
import pandas as pd
from StringIO import StringIO
# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))
# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)
# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'
# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)
# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)
# get rid of the ward summary lines
df = df.ix[~ws_mask]
# get rid of the Desc column
df.drop('Desc', axis=1)
是;您不止一次传递数据,因此您可以使用更智能的单次传递算法做得更好。但是,如果性能不是您主要关心的问题,我认为这在简洁性和可读性方面有好处。
我正在将 CSV 文件读入 pandas。问题是文件需要删除行和其他行上的计算值。我现在的想法是这样开始的
with open(down_path.name) as csv_file:
rdr = csv.DictReader(csv_file)
for row in rdr:
type = row['']
if type == 'Summary':
current_ward = row['Name']
else:
name = row['Name']
count1 = row['Count1']
count2 = row['Count2']
count3 = row['Count3']
index_count += 1
# write to someplace
,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0
最终结果需要在一个数据帧中结束,我可以将其连接到现有数据帧。
做这件事的脑残方法就是简单地进行我的转换并创建一个新的 CSV 文件,然后读入它。似乎是一种非 pythonic 的方法。
需要删除摘要行,合并名称相似的行(Aloha 1 和 Aloha I),删除个人统计信息并将 Aloha 1 标签贴在每个人身上。另外我需要添加这个数据来自哪个月。如您所见,数据需要一些工作:)
期望的输出是 Jan-16,阿罗哈 1,约翰,1,2,3
Aloha 1 来自上面的摘要行
我个人的偏好是在 Pandas 中完成所有事情。
也许是这样的……
# imports
import numpy as np
import pandas as pd
from StringIO import StringIO
# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))
# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)
# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'
# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)
# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)
# get rid of the ward summary lines
df = df.ix[~ws_mask]
# get rid of the Desc column
df.drop('Desc', axis=1)
是;您不止一次传递数据,因此您可以使用更智能的单次传递算法做得更好。但是,如果性能不是您主要关心的问题,我认为这在简洁性和可读性方面有好处。