创建包含特定索引值的新列

Question

我尝试了多种方法，使我接近但不完全是我想要的最终输出。我试图首先根据其位置在原始数据框中创建包含特定内容的几列，之后我试图将特定行设为 header 行并跳过其上方的所有行。

原始输入：

    |           NA            |  NA_1 |  NA_2  |  NA_3 |
0   | 12-Month Percent Change |  NaN  |  NaN   |  NaN  |
1   | Series Id: CUUR0000SAF1 |  NaN  |  NaN   |  NaN  |
2   |       Item: Food        |  NaN  |  NaN   |  NaN  |
3   |           Year          |  Jan  |  Feb   |  Mar  |
4   |           2010          | -0.4  | -0.2   |  0.2  |
5   |           2011          |  1.8  |  2.3   |  2.9  |

使用的代码：

df1['View Description'] = df1.iat[0,0]
df1['Series ID'] = df1.iat[1,1]
df1['Series Name'] = df1.iat[2,1]
df1

结果：

    NA  NA.1    NA.2    NA.3    NA.4    NA.5    NA.6    NA.7    View Description    Series ID   Series Name
0   12-Month Percent Change NaN NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1    Food
1   Series Id:  CUUR0000SAF1    NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1    Food
2   Item:   Food    NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1    Food
3   Year    Jan Feb Mar Apr May Jun Jul 12-Month Percent Change CUUR0000SAF1    Food
4   2010    -0.4    -0.2    0.2 0.5 0.7 0.7 0.9 12-Month Percent Change CUUR0000SAF1    Food
5   2011    1.8 2.3 2.9 3.2 3.5 3.7 4.2 12-Month Percent Change CUUR0000SAF1    Food
6   2012    4.4 3.9 3.3 3.1 2.8 2.7 2.3 12-Month Percent Change CUUR0000SAF1    Food
7   2013    1.6 1.6 1.5 1.5 1.4 1.4 1.4 12-Month Percent Change CUUR0000SAF1    Food

最后一件事是我想让 header 成为第 3 行并删除它上面的所有行。但最后仍保留三列：1) 查看说明、系列 ID、系列名称。

任何关于下一步可以完成此操作的有效方法的建议我想用 for 循环或可以对 x10 文件执行此过程的方法进行扩展。

提前致谢！

Answer 1

我认为您的问题是这样的：

# Parse and store the first 3 values in column 0 so that we can use them 
# as values for 3 new columns later.
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]

# Transpose so that we can use set_index() to replace the index 
# (the columns from the original df1) to ['Item: Food', NaN, NaN, NaN], 
# then transpose back so that the new index becomes the columns.
df1 = df1.T.set_index(3).T

# Use reset_index() to replace the index with a fresh range 
# index (0, 1, 2, ...) so we can use iloc() to discard the 
# first 3 unwanted rows, then call reset_index() again.
df1 = df1.reset_index(drop=True).iloc[3:].reset_index(drop=True)

# Get rid of vestigial name for columns.
df1.columns.names = [None]

# Add the three new columns set to the values saved earlier.
df1[['View Description', 'Series ID', 'Series Name']] = new_columns

这是完整的测试用例（上面的注释代码被压缩成更少的行）：

import pandas as pd
s = [
'    |           NA            |  NA_1 |  NA_2  |  NA_3 |',
'0   | 12-Month Percent Change |  NaN  |  NaN   |  NaN  |',
'1   | Series Id: CUUR0000SAF1 |  NaN  |  NaN   |  NaN  |',
'2   |       Item: Food        |  NaN  |  NaN   |  NaN  |',
'3   |           Year          |  Jan  |  Feb   |  Mar  |',
'4   |           2010          | -0.4  | -0.2   |  0.2  |',
'5   |           2011          |  1.8  |  2.3   |  2.9  |']

df1 = pd.DataFrame(
    [[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
    columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
df1 = df1.T.set_index(3).T.reset_index(drop=True).iloc[3:].reset_index(drop=True)
df1.columns.names = [None]
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)

输出：

                        NA  NA_1  NA_2 NA_3
0  12-Month Percent Change   NaN   NaN  NaN
1  Series Id: CUUR0000SAF1   NaN   NaN  NaN
2               Item: Food   NaN   NaN  NaN
3                     Year   Jan   Feb  Mar
4                     2010  -0.4  -0.2  0.2
5                     2011   1.8   2.3  2.9
   Year   Jan   Feb  Mar         View Description     Series ID Series Name
0  2010  -0.4  -0.2  0.2  12-Month Percent Change  CUUR0000SAF1        Food
1  2011   1.8   2.3  2.9  12-Month Percent Change  CUUR0000SAF1        Food

UPDATE：这段代码允许我们配置 (1) 3 个单元格中每一个的单元格坐标以用于新列值 (new_col_coords)和 (2) header_row 上面的行被丢弃：

import pandas as pd
s = [
'    |           NA            |  NA_1 |  NA_2  |  NA_3 |',
'0   | 12-Month Percent Change |  NaN  |  NaN   |  NaN  |',
'91  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'1   | Series Id: CUUR0000SAF1 |  Abc  |  NaN   |  NaN  |',
'92  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'93  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'94  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'2   |       Item: Food        |  Xyz  |  NaN   |  NaN  |',
'95  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'96  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'97  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'98  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'3   |           Year          |  Jan  |  Feb   |  Mar  |',
'4   |           2010          | -0.4  | -0.2   |  0.2  |',
'5   |           2011          |  1.8  |  2.3   |  2.9  |']

df1 = pd.DataFrame(
    [[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
    columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)

# parse and store the 3 values at specified coordinates so that we can use them as values for 3 new columns later
new_col_coords = [[0,0], [2,1], [6,1]]
new_columns = [x.split(':')[-1].strip() for x in [df1.iloc[i, j] for i, j in new_col_coords]]

header_row = 11

# Here's how to do everything that follows in one line of code:
#df1 = df1.T.set_index(header_row).T.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)

# Transpose so that we can use set_index() to change the index to ['Item: Food', NaN, NaN, NaN], then transpose back so that index becomes the columns
df1 = df1.T.set_index(header_row).T

# Use reset_index() to replace the index with a fresh range index (0, 1, 2, ...) so we can use iloc() to discard the unwanted rows above header_row, then call reset_index() again
df1 = df1.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)

# Get rid of vestigial name for columns
df1.columns.names = [None]

# Add the three new columns set to the values saved earlier
df1[['View Description', 'Series ID', 'Series Name']] = new_columns

print(df1)

输出：

                         NA  NA_1  NA_2 NA_3
0   12-Month Percent Change   NaN   NaN  NaN
1           To be discarded   NaN   NaN  NaN
2   Series Id: CUUR0000SAF1   Abc   NaN  NaN
3           To be discarded   NaN   NaN  NaN
4           To be discarded   NaN   NaN  NaN
5           To be discarded   NaN   NaN  NaN
6                Item: Food   Xyz   NaN  NaN
7           To be discarded   NaN   NaN  NaN
8           To be discarded   NaN   NaN  NaN
9           To be discarded   NaN   NaN  NaN
10          To be discarded   NaN   NaN  NaN
11                     Year   Jan   Feb  Mar
12                     2010  -0.4  -0.2  0.2
13                     2011   1.8   2.3  2.9
   Year   Jan   Feb  Mar         View Description Series ID Series Name
0  2010  -0.4  -0.2  0.2  12-Month Percent Change       Abc         Xyz
1  2011   1.8   2.3  2.9  12-Month Percent Change       Abc         Xyz

创建包含特定索引值的新列

Creating new columns that contain the value of a specific index

python

numpy

dataframe

pandas