在加载到数据帧之前读取需要数据清理的 CSV 文件

Question

我正在将 CSV 文件读入 pandas。问题是文件需要删除行和其他行上的计算值。我现在的想法是这样开始的

    with open(down_path.name) as csv_file:
    rdr = csv.DictReader(csv_file)
    for row in rdr:
        type = row['']
        if type == 'Summary':
            current_ward = row['Name']
        else:
            name = row['Name']
            count1 = row['Count1']
            count2 = row['Count2']
            count3 = row['Count3']
            index_count += 1
        # write to someplace

,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0

最终结果需要在一个数据帧中结束，我可以将其连接到现有数据帧。

做这件事的脑残方法就是简单地进行我的转换并创建一个新的 CSV 文件，然后读入它。似乎是一种非 pythonic 的方法。

需要删除摘要行，合并名称相似的行（Aloha 1 和 Aloha I），删除个人统计信息并将 Aloha 1 标签贴在每个人身上。另外我需要添加这个数据来自哪个月。如您所见，数据需要一些工作:)

期望的输出是 Jan-16，阿罗哈 1，约翰，1,2,3

Aloha 1 来自上面的摘要行

Answer 1

我个人的偏好是在 Pandas 中完成所有事情。

也许是这样的……

# imports
import numpy as np
import pandas as pd
from StringIO import StringIO

# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))

# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)

# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'

# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)

# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)

# get rid of the ward summary lines
df = df.ix[~ws_mask]

# get rid of the Desc column
df.drop('Desc', axis=1)

是；您不止一次传递数据，因此您可以使用更智能的单次传递算法做得更好。但是，如果性能不是您主要关心的问题，我认为这在简洁性和可读性方面有好处。

在加载到数据帧之前读取需要数据清理的 CSV 文件

Read CSV file that needs data sanitization prior to loading into dataframe

python

csv

sanitization

dataframe