创建包含特定索引值的新列
Creating new columns that contain the value of a specific index
我尝试了多种方法,使我接近但不完全是我想要的最终输出。我试图首先根据其位置在原始数据框中创建包含特定内容的几列,之后我试图将特定行设为 header 行并跳过其上方的所有行。
原始输入:
| NA | NA_1 | NA_2 | NA_3 |
0 | 12-Month Percent Change | NaN | NaN | NaN |
1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |
2 | Item: Food | NaN | NaN | NaN |
3 | Year | Jan | Feb | Mar |
4 | 2010 | -0.4 | -0.2 | 0.2 |
5 | 2011 | 1.8 | 2.3 | 2.9 |
使用的代码:
df1['View Description'] = df1.iat[0,0]
df1['Series ID'] = df1.iat[1,1]
df1['Series Name'] = df1.iat[2,1]
df1
结果:
NA NA.1 NA.2 NA.3 NA.4 NA.5 NA.6 NA.7 View Description Series ID Series Name
0 12-Month Percent Change NaN NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
1 Series Id: CUUR0000SAF1 NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
2 Item: Food NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
3 Year Jan Feb Mar Apr May Jun Jul 12-Month Percent Change CUUR0000SAF1 Food
4 2010 -0.4 -0.2 0.2 0.5 0.7 0.7 0.9 12-Month Percent Change CUUR0000SAF1 Food
5 2011 1.8 2.3 2.9 3.2 3.5 3.7 4.2 12-Month Percent Change CUUR0000SAF1 Food
6 2012 4.4 3.9 3.3 3.1 2.8 2.7 2.3 12-Month Percent Change CUUR0000SAF1 Food
7 2013 1.6 1.6 1.5 1.5 1.4 1.4 1.4 12-Month Percent Change CUUR0000SAF1 Food
最后一件事是我想让 header 成为第 3 行并删除它上面的所有行。但最后仍保留三列:1) 查看说明、系列 ID、系列名称。
任何关于下一步可以完成此操作的有效方法的建议我想用 for 循环或可以对 x10 文件执行此过程的方法进行扩展。
提前致谢!
我认为您的问题是这样的:
# Parse and store the first 3 values in column 0 so that we can use them
# as values for 3 new columns later.
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
# Transpose so that we can use set_index() to replace the index
# (the columns from the original df1) to ['Item: Food', NaN, NaN, NaN],
# then transpose back so that the new index becomes the columns.
df1 = df1.T.set_index(3).T
# Use reset_index() to replace the index with a fresh range
# index (0, 1, 2, ...) so we can use iloc() to discard the
# first 3 unwanted rows, then call reset_index() again.
df1 = df1.reset_index(drop=True).iloc[3:].reset_index(drop=True)
# Get rid of vestigial name for columns.
df1.columns.names = [None]
# Add the three new columns set to the values saved earlier.
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
这是完整的测试用例(上面的注释代码被压缩成更少的行):
import pandas as pd
s = [
' | NA | NA_1 | NA_2 | NA_3 |',
'0 | 12-Month Percent Change | NaN | NaN | NaN |',
'1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |',
'2 | Item: Food | NaN | NaN | NaN |',
'3 | Year | Jan | Feb | Mar |',
'4 | 2010 | -0.4 | -0.2 | 0.2 |',
'5 | 2011 | 1.8 | 2.3 | 2.9 |']
df1 = pd.DataFrame(
[[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
df1 = df1.T.set_index(3).T.reset_index(drop=True).iloc[3:].reset_index(drop=True)
df1.columns.names = [None]
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)
输出:
NA NA_1 NA_2 NA_3
0 12-Month Percent Change NaN NaN NaN
1 Series Id: CUUR0000SAF1 NaN NaN NaN
2 Item: Food NaN NaN NaN
3 Year Jan Feb Mar
4 2010 -0.4 -0.2 0.2
5 2011 1.8 2.3 2.9
Year Jan Feb Mar View Description Series ID Series Name
0 2010 -0.4 -0.2 0.2 12-Month Percent Change CUUR0000SAF1 Food
1 2011 1.8 2.3 2.9 12-Month Percent Change CUUR0000SAF1 Food
UPDATE:这段代码允许我们配置 (1) 3 个单元格中每一个的单元格坐标以用于新列值 (new_col_coords
)和 (2) header_row
上面的行被丢弃:
import pandas as pd
s = [
' | NA | NA_1 | NA_2 | NA_3 |',
'0 | 12-Month Percent Change | NaN | NaN | NaN |',
'91 | To be discarded | NaN | NaN | NaN |',
'1 | Series Id: CUUR0000SAF1 | Abc | NaN | NaN |',
'92 | To be discarded | NaN | NaN | NaN |',
'93 | To be discarded | NaN | NaN | NaN |',
'94 | To be discarded | NaN | NaN | NaN |',
'2 | Item: Food | Xyz | NaN | NaN |',
'95 | To be discarded | NaN | NaN | NaN |',
'96 | To be discarded | NaN | NaN | NaN |',
'97 | To be discarded | NaN | NaN | NaN |',
'98 | To be discarded | NaN | NaN | NaN |',
'3 | Year | Jan | Feb | Mar |',
'4 | 2010 | -0.4 | -0.2 | 0.2 |',
'5 | 2011 | 1.8 | 2.3 | 2.9 |']
df1 = pd.DataFrame(
[[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
# parse and store the 3 values at specified coordinates so that we can use them as values for 3 new columns later
new_col_coords = [[0,0], [2,1], [6,1]]
new_columns = [x.split(':')[-1].strip() for x in [df1.iloc[i, j] for i, j in new_col_coords]]
header_row = 11
# Here's how to do everything that follows in one line of code:
#df1 = df1.T.set_index(header_row).T.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)
# Transpose so that we can use set_index() to change the index to ['Item: Food', NaN, NaN, NaN], then transpose back so that index becomes the columns
df1 = df1.T.set_index(header_row).T
# Use reset_index() to replace the index with a fresh range index (0, 1, 2, ...) so we can use iloc() to discard the unwanted rows above header_row, then call reset_index() again
df1 = df1.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)
# Get rid of vestigial name for columns
df1.columns.names = [None]
# Add the three new columns set to the values saved earlier
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)
输出:
NA NA_1 NA_2 NA_3
0 12-Month Percent Change NaN NaN NaN
1 To be discarded NaN NaN NaN
2 Series Id: CUUR0000SAF1 Abc NaN NaN
3 To be discarded NaN NaN NaN
4 To be discarded NaN NaN NaN
5 To be discarded NaN NaN NaN
6 Item: Food Xyz NaN NaN
7 To be discarded NaN NaN NaN
8 To be discarded NaN NaN NaN
9 To be discarded NaN NaN NaN
10 To be discarded NaN NaN NaN
11 Year Jan Feb Mar
12 2010 -0.4 -0.2 0.2
13 2011 1.8 2.3 2.9
Year Jan Feb Mar View Description Series ID Series Name
0 2010 -0.4 -0.2 0.2 12-Month Percent Change Abc Xyz
1 2011 1.8 2.3 2.9 12-Month Percent Change Abc Xyz
我尝试了多种方法,使我接近但不完全是我想要的最终输出。我试图首先根据其位置在原始数据框中创建包含特定内容的几列,之后我试图将特定行设为 header 行并跳过其上方的所有行。
原始输入:
| NA | NA_1 | NA_2 | NA_3 |
0 | 12-Month Percent Change | NaN | NaN | NaN |
1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |
2 | Item: Food | NaN | NaN | NaN |
3 | Year | Jan | Feb | Mar |
4 | 2010 | -0.4 | -0.2 | 0.2 |
5 | 2011 | 1.8 | 2.3 | 2.9 |
使用的代码:
df1['View Description'] = df1.iat[0,0]
df1['Series ID'] = df1.iat[1,1]
df1['Series Name'] = df1.iat[2,1]
df1
结果:
NA NA.1 NA.2 NA.3 NA.4 NA.5 NA.6 NA.7 View Description Series ID Series Name
0 12-Month Percent Change NaN NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
1 Series Id: CUUR0000SAF1 NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
2 Item: Food NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
3 Year Jan Feb Mar Apr May Jun Jul 12-Month Percent Change CUUR0000SAF1 Food
4 2010 -0.4 -0.2 0.2 0.5 0.7 0.7 0.9 12-Month Percent Change CUUR0000SAF1 Food
5 2011 1.8 2.3 2.9 3.2 3.5 3.7 4.2 12-Month Percent Change CUUR0000SAF1 Food
6 2012 4.4 3.9 3.3 3.1 2.8 2.7 2.3 12-Month Percent Change CUUR0000SAF1 Food
7 2013 1.6 1.6 1.5 1.5 1.4 1.4 1.4 12-Month Percent Change CUUR0000SAF1 Food
最后一件事是我想让 header 成为第 3 行并删除它上面的所有行。但最后仍保留三列:1) 查看说明、系列 ID、系列名称。
任何关于下一步可以完成此操作的有效方法的建议我想用 for 循环或可以对 x10 文件执行此过程的方法进行扩展。
提前致谢!
我认为您的问题是这样的:
# Parse and store the first 3 values in column 0 so that we can use them
# as values for 3 new columns later.
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
# Transpose so that we can use set_index() to replace the index
# (the columns from the original df1) to ['Item: Food', NaN, NaN, NaN],
# then transpose back so that the new index becomes the columns.
df1 = df1.T.set_index(3).T
# Use reset_index() to replace the index with a fresh range
# index (0, 1, 2, ...) so we can use iloc() to discard the
# first 3 unwanted rows, then call reset_index() again.
df1 = df1.reset_index(drop=True).iloc[3:].reset_index(drop=True)
# Get rid of vestigial name for columns.
df1.columns.names = [None]
# Add the three new columns set to the values saved earlier.
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
这是完整的测试用例(上面的注释代码被压缩成更少的行):
import pandas as pd
s = [
' | NA | NA_1 | NA_2 | NA_3 |',
'0 | 12-Month Percent Change | NaN | NaN | NaN |',
'1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |',
'2 | Item: Food | NaN | NaN | NaN |',
'3 | Year | Jan | Feb | Mar |',
'4 | 2010 | -0.4 | -0.2 | 0.2 |',
'5 | 2011 | 1.8 | 2.3 | 2.9 |']
df1 = pd.DataFrame(
[[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
df1 = df1.T.set_index(3).T.reset_index(drop=True).iloc[3:].reset_index(drop=True)
df1.columns.names = [None]
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)
输出:
NA NA_1 NA_2 NA_3
0 12-Month Percent Change NaN NaN NaN
1 Series Id: CUUR0000SAF1 NaN NaN NaN
2 Item: Food NaN NaN NaN
3 Year Jan Feb Mar
4 2010 -0.4 -0.2 0.2
5 2011 1.8 2.3 2.9
Year Jan Feb Mar View Description Series ID Series Name
0 2010 -0.4 -0.2 0.2 12-Month Percent Change CUUR0000SAF1 Food
1 2011 1.8 2.3 2.9 12-Month Percent Change CUUR0000SAF1 Food
UPDATE:这段代码允许我们配置 (1) 3 个单元格中每一个的单元格坐标以用于新列值 (new_col_coords
)和 (2) header_row
上面的行被丢弃:
import pandas as pd
s = [
' | NA | NA_1 | NA_2 | NA_3 |',
'0 | 12-Month Percent Change | NaN | NaN | NaN |',
'91 | To be discarded | NaN | NaN | NaN |',
'1 | Series Id: CUUR0000SAF1 | Abc | NaN | NaN |',
'92 | To be discarded | NaN | NaN | NaN |',
'93 | To be discarded | NaN | NaN | NaN |',
'94 | To be discarded | NaN | NaN | NaN |',
'2 | Item: Food | Xyz | NaN | NaN |',
'95 | To be discarded | NaN | NaN | NaN |',
'96 | To be discarded | NaN | NaN | NaN |',
'97 | To be discarded | NaN | NaN | NaN |',
'98 | To be discarded | NaN | NaN | NaN |',
'3 | Year | Jan | Feb | Mar |',
'4 | 2010 | -0.4 | -0.2 | 0.2 |',
'5 | 2011 | 1.8 | 2.3 | 2.9 |']
df1 = pd.DataFrame(
[[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
# parse and store the 3 values at specified coordinates so that we can use them as values for 3 new columns later
new_col_coords = [[0,0], [2,1], [6,1]]
new_columns = [x.split(':')[-1].strip() for x in [df1.iloc[i, j] for i, j in new_col_coords]]
header_row = 11
# Here's how to do everything that follows in one line of code:
#df1 = df1.T.set_index(header_row).T.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)
# Transpose so that we can use set_index() to change the index to ['Item: Food', NaN, NaN, NaN], then transpose back so that index becomes the columns
df1 = df1.T.set_index(header_row).T
# Use reset_index() to replace the index with a fresh range index (0, 1, 2, ...) so we can use iloc() to discard the unwanted rows above header_row, then call reset_index() again
df1 = df1.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)
# Get rid of vestigial name for columns
df1.columns.names = [None]
# Add the three new columns set to the values saved earlier
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)
输出:
NA NA_1 NA_2 NA_3
0 12-Month Percent Change NaN NaN NaN
1 To be discarded NaN NaN NaN
2 Series Id: CUUR0000SAF1 Abc NaN NaN
3 To be discarded NaN NaN NaN
4 To be discarded NaN NaN NaN
5 To be discarded NaN NaN NaN
6 Item: Food Xyz NaN NaN
7 To be discarded NaN NaN NaN
8 To be discarded NaN NaN NaN
9 To be discarded NaN NaN NaN
10 To be discarded NaN NaN NaN
11 Year Jan Feb Mar
12 2010 -0.4 -0.2 0.2
13 2011 1.8 2.3 2.9
Year Jan Feb Mar View Description Series ID Series Name
0 2010 -0.4 -0.2 0.2 12-Month Percent Change Abc Xyz
1 2011 1.8 2.3 2.9 12-Month Percent Change Abc Xyz