从只有几行的 csv 文件填充数据帧的 pythonic 方法是什么？

Question

我有一个只有以下两行的 csv 文件：

ID    CODE       DATE        STARTDATE         AVERAGE1    AGGREGATE    MEDIAN1    MEDIAN2    AVERAGE2    AVERAGE3    POSITION
123    112     2016OCT25     2016OCT25         821          3              1021       4000       172         183          S8
124    111     2016OCT25     2016OCT25         834          3              1025       4210       173         185          ENG

位置列采用 S1-S8 和 T1-T8 的值（总共 16 个）。我想创建一个如下所示的数据框：

 AVERAGE1  MEDIAN1   MEDIAN2   AVERAGE2   AVERAGE3         POSITION                        
                                                                 S1

                                                                 S2

                                                                 S3

                                                                 S4

                                                                 S5

                                                                 S6

                                                                 S7

                                                                 S8

                                                                 T1

                                                                 T2

                                                                 T3

                                                                 T4

                                                                 T5

                                                                 T6

                                                                 T7

                                                                 T8

现在的问题是，如果 csv 文件中的 'POSITION' 列读取 'ENG' 列数据中的相同对应数据必须填充数据帧中的所有位置，从 S1- T8。如果它读取任何特定位置，例如上面第二列中的 S8，则必须将 csv 中该列的数据填充到数据框中的相应位置。为了更清楚，输出应如下所示：

 AVERAGE1     MEDIAN1     MEDIAN2     AVERAGE2      AVERAGE3    POSITION                        
    821          1021        4000       172           183        S1

    821          1021        4000       172           183        S2                                              

    821          1021        4000       172           183        S3                                                     

    821          1021        4000       172           183        S4  

    821          1021        4000       172           183        S5                                         

    821          1021        4000       172           183        S6                                                   

    821          1021        4000       172           183        S7                                                  

    834          1025        4210       173           185        S8                                                    

    821          1021        4000       172           183        T1                                                     

    821          1021        4000       172           183        T2

    821          1021        4000       172           183        T3

    821          1021        4000       172           183        T4

    821          1021        4000       172           183        T5

    821          1021        4000       172           183        T6

    821          1021        4000       172           183        T7

    821          1021        4000       172           183        T8

我通过创建一个空数据框来完成此操作，然后在读取 csv 后用 df.iloc 填充单元格。任何更有效或 pythonic 的方法将不胜感激。提前致谢！

Answer 1

我想你可以使用 reindex by pos, fillna by values from row with ENG in column POSITION, drop unnecessary columns, cast to int and last reset_index:

pos = ['S' + str(x) for x in range(1, 9)] + ['T' + str(x) for x in range(1,9)] 
print (pos)
['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 
 'T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8']

df.set_index('POSITION', inplace=True)

print (df.reindex(pos).fillna(df.ix['ENG']))
             ID   CODE       DATE  STARTDATE  AVERAGE1  AGGREGATE  MEDIAN1  \
POSITION                                                                     
S1        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S2        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S3        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S4        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S5        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S6        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S7        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S8        123.0  112.0  2016OCT25  2016OCT25     821.0        3.0   1021.0   
T1        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T2        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T3        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T4        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T5        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T6        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T7        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T8        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   

          MEDIAN2  AVERAGE2  AVERAGE3  
POSITION                               
S1         4210.0     173.0     185.0  
S2         4210.0     173.0     185.0  
S3         4210.0     173.0     185.0  
S4         4210.0     173.0     185.0  
S5         4210.0     173.0     185.0  
S6         4210.0     173.0     185.0  
S7         4210.0     173.0     185.0  
S8         4000.0     172.0     183.0  
T1         4210.0     173.0     185.0  
T2         4210.0     173.0     185.0  
T3         4210.0     173.0     185.0  
T4         4210.0     173.0     185.0  
T5         4210.0     173.0     185.0  
T6         4210.0     173.0     185.0  
T7         4210.0     173.0     185.0  
T8         4210.0     173.0     185.0

df = df.reindex(pos)
       .fillna(df.ix['ENG'])
       .drop(['ID','CODE','DATE','STARTDATE'],axis=1)
       .astype(int)
       .reset_index()

print (df)
   POSITION  AVERAGE1  AGGREGATE  MEDIAN1  MEDIAN2  AVERAGE2  AVERAGE3
0        S1       834          3     1025     4210       173       185
1        S2       834          3     1025     4210       173       185
2        S3       834          3     1025     4210       173       185
3        S4       834          3     1025     4210       173       185
4        S5       834          3     1025     4210       173       185
5        S6       834          3     1025     4210       173       185
6        S7       834          3     1025     4210       173       185
7        S8       821          3     1021     4000       172       183
8        T1       834          3     1025     4210       173       185
9        T2       834          3     1025     4210       173       185
10       T3       834          3     1025     4210       173       185
11       T4       834          3     1025     4210       173       185
12       T5       834          3     1025     4210       173       185
13       T6       834          3     1025     4210       173       185
14       T7       834          3     1025     4210       173       185
15       T8       834          3     1025     4210       173       185

另一个解决方案：

df.set_index('POSITION', inplace=True)
#remove unnecessary columns
df = df.drop(['ID','CODE','DATE','STARTDATE'],axis=1)

#index values
pos = ['S' + str(x) for x in range(1, 9)] + ['T' + str(x) for x in range(1,9)] 
#DataFrame constructor
df1 = pd.DataFrame(data=[df.ix['ENG']], index=pos)
df1.index.name = 'POSITION'

#get all not ENG index values
not_eng_idx = df.drop('ENG').index
print (not_eng_idx)
Index(['S8'], dtype='object', name='POSITION')

#overwrite DataFrame by values from df
df1.ix[not_eng_idx] = df.ix[not_eng_idx]
df1.reset_index(inplace=True)

print (df1)
   POSITION  AVERAGE1  AGGREGATE  MEDIAN1  MEDIAN2  AVERAGE2  AVERAGE3
0        S1       834          3     1025     4210       173       185
1        S2       834          3     1025     4210       173       185
2        S3       834          3     1025     4210       173       185
3        S4       834          3     1025     4210       173       185
4        S5       834          3     1025     4210       173       185
5        S6       834          3     1025     4210       173       185
6        S7       834          3     1025     4210       173       185
7        S8       821          3     1021     4000       172       183
8        T1       834          3     1025     4210       173       185
9        T2       834          3     1025     4210       173       185
10       T3       834          3     1025     4210       173       185
11       T4       834          3     1025     4210       173       185
12       T5       834          3     1025     4210       173       185
13       T6       834          3     1025     4210       173       185
14       T7       834          3     1025     4210       173       185
15       T8       834          3     1025     4210       173       185

从只有几行的 csv 文件填充数据帧的 pythonic 方法是什么？

What is a pythonic way of populating in a dataframe from a csv file that has only few rows?

python

dataframe

python-2.7

pandas

reindex