处理堆叠的合并表

Dealing with Merged Tables that are Stacked

我通过 Pandas read_table 导入了一个 csv,它本质上是一个堆叠的列,其中每个学生都被命名,然后在下面的行中找到值。

Student-John
01/01/2021 334
01/02/2021 456
Student-Sally
01/01/2021 76
01/04/2021 789

我想旋转这些,以便每个学生都有自己的列,并且日期在左侧。

Date Student-Jon Student-Sally
01/01/2021 334 76
01/02/2021 456
01/04/2021 789

我的方法是通过 pandas 数据框引入 CSV。

import pandas as pd
df = pd.read_table('C:/Users/****data.csv', skiprows=1, header=None)
df[2]=""
df.columns = "Date", "Val"

x="Start"

#Started with this although the Student line doesn't work

for ind, row in df.iterrows():
    if df['Date'][ind] == "Student*":
        x = df['Date'][ind]
        df.drop(ind, inplace=True)
    else:
        df['Val'][ind] = x 

一个朴素的解决方案看起来像:

import pandas as pd

details = {
    'Date' : ['Student-John','01/01/2021','01/02/2021','Student-Sally','01/01/2021','01/02/2021'],
    'val' : ['', 'Y', 'N', '', 'N', 'N'],
}

df = pd.DataFrame(details)
print(df)

            Date val
0   Student-John    
1     01/01/2021   Y
2     01/02/2021   N
3  Student-Sally    
4     01/01/2021   N
5     01/02/2021   N

# Creating a new data frame - df1
# adding new dates to dates list and new column names
# to columns list, temp list contains Y/N values
# Adding dates and Y/N values to dataframe whenever
# we find a new column name

det = {}
df1 = pd.DataFrame(det)
temp_list = []
date_list = []
col_name = 'empty'
for ind in df.index:
  if df['val'][ind] == '':
    df1[col_name] = temp_list
    df1['Date'] = date_list
    temp_list = []
    date_list = []
    col_name = df['Date'][ind]
  else:
    temp_list.append(df['val'][ind])
    date_list.append(df['Date'][ind])
    if ind == len(df)-1:
       df1[col_name] = temp_list

del df1['empty']
print(df1)

         Date Student-John Student-Sally
0  01/01/2021            Y             N
1  01/02/2021            N             N

使用布尔掩码过滤掉您的数据框,然后 pivot 重塑它:

# Rename columns
df.columns = ['Date', 'Value']

# Find Student rows
m = df[0].str.startswith('Student')

# Create the future column
df['Student'] = df[0].mask(~m).ffill()

# Remove Student rows
df = df[~m]

# Reshape your dataframe
df = df.pivot('Date', 'Student', 'Value').rename_axis(columns=None).reset_index()

输出:

>>> df
         Date Student-John Student-Sally
0  01/01/2021          334            76
1  01/02/2021          456           NaN
2  01/04/2021          NaN           789

设置:

import pandas as pd
import numpy as np

data = {0: ['Student-John', '01/01/2021', '01/02/2021',
            'Student-Sally', '01/01/2021', '01/04/2021'],
        1: [np.nan, '334', '456', np.nan, '76', '789']}
df = pd.DataFrame(data)
print(df)

# Output
               0    1
0   Student-John  NaN
1     01/01/2021  334
2     01/02/2021  456
3  Student-Sally  NaN
4     01/01/2021   76
5     01/04/2021  789