索引和值在同一列时如何进行多索引透视?
How to do Multi Index Pivot when index and values are in the same column?
我有这个框架:
regions = pd.read_html('http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html')
messy_regions = regions[8]
这会产生这样的结果:
|0 | 1
--- |---| ---
0| Region 1 (The Northeast)| nan
1| Division 1 (New England)| Division 2 (Middle Atlantic)
2| Maine | New York
3| New Hampshire | Pennsylvania
4| Vermont | New Jersey
5| Massachusetts |nan
6| Rhode Island |nan
7| Connecticut | nan
8| Region 2 (The Midwest) | nan
9| Division 3 (East North Central)| Division 4 (West North Central)
10| Wisconsin | North Dakota
11| Michigan | South Dakota
12| Illinois | Nebraska
目标是使它成为一个整洁的数据框,我认为我需要进行调整,以便将区域和分区作为列,并将状态作为正确 regions/divisions 下的值。一旦它变成那个形状,我就可以融化成想要的形状。我不知道如何从中提取 headers 列。感谢任何帮助,至少是正确方向的一个好点。
您可以使用:
url = 'http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html'
#input dataframe with columns a, b
df = pd.read_html(url)[8]
df.columns = ['a','b']
#extract Region data to new column
df['Region'] = df['a'].where(df['a'].str.contains('Region', na=False)).ffill()
#reshaping, remove rows with NaNs, remove column variable
df = pd.melt(df, id_vars='Region', value_name='Names')
.sort_values(['Region', 'variable'])
.dropna()
.drop('variable', axis=1)
#extract Division data to new column
df['Division'] = df['Names'].where(df['Names'].str.contains('Division', na=False)).ffill()
#remove duplicates from column Names, change order of columns
df = df[(df.Division != df.Names) & (df.Region != df.Names)]
.reset_index(drop=False)
.reindex_axis(['Region','Division','Names'], axis=1)
#temporaly display all columns
with pd.option_context('display.expand_frame_repr', False):
print (df)
Region Division Names
0 Region 1 (The Northeast) Division 1 (New England) Maine
1 Region 1 (The Northeast) Division 1 (New England) New Hampshire
2 Region 1 (The Northeast) Division 1 (New England) Vermont
3 Region 1 (The Northeast) Division 1 (New England) Massachusetts
4 Region 1 (The Northeast) Division 1 (New England) Rhode Island
5 Region 1 (The Northeast) Division 1 (New England) Connecticut
6 Region 1 (The Northeast) Division 2 (Middle Atlantic) New York
7 Region 1 (The Northeast) Division 2 (Middle Atlantic) Pennsylvania
8 Region 1 (The Northeast) Division 2 (Middle Atlantic) New Jersey
9 Region 2 (The Midwest) Division 3 (East North Central) Wisconsin
10 Region 2 (The Midwest) Division 3 (East North Central) Michigan
11 Region 2 (The Midwest) Division 3 (East North Central) Illinois
12 Region 2 (The Midwest) Division 3 (East North Central) Indiana
13 Region 2 (The Midwest) Division 3 (East North Central) Ohio
...
...
我有这个框架:
regions = pd.read_html('http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html')
messy_regions = regions[8]
这会产生这样的结果:
|0 | 1
--- |---| ---
0| Region 1 (The Northeast)| nan
1| Division 1 (New England)| Division 2 (Middle Atlantic)
2| Maine | New York
3| New Hampshire | Pennsylvania
4| Vermont | New Jersey
5| Massachusetts |nan
6| Rhode Island |nan
7| Connecticut | nan
8| Region 2 (The Midwest) | nan
9| Division 3 (East North Central)| Division 4 (West North Central)
10| Wisconsin | North Dakota
11| Michigan | South Dakota
12| Illinois | Nebraska
目标是使它成为一个整洁的数据框,我认为我需要进行调整,以便将区域和分区作为列,并将状态作为正确 regions/divisions 下的值。一旦它变成那个形状,我就可以融化成想要的形状。我不知道如何从中提取 headers 列。感谢任何帮助,至少是正确方向的一个好点。
您可以使用:
url = 'http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html'
#input dataframe with columns a, b
df = pd.read_html(url)[8]
df.columns = ['a','b']
#extract Region data to new column
df['Region'] = df['a'].where(df['a'].str.contains('Region', na=False)).ffill()
#reshaping, remove rows with NaNs, remove column variable
df = pd.melt(df, id_vars='Region', value_name='Names')
.sort_values(['Region', 'variable'])
.dropna()
.drop('variable', axis=1)
#extract Division data to new column
df['Division'] = df['Names'].where(df['Names'].str.contains('Division', na=False)).ffill()
#remove duplicates from column Names, change order of columns
df = df[(df.Division != df.Names) & (df.Region != df.Names)]
.reset_index(drop=False)
.reindex_axis(['Region','Division','Names'], axis=1)
#temporaly display all columns
with pd.option_context('display.expand_frame_repr', False):
print (df)
Region Division Names
0 Region 1 (The Northeast) Division 1 (New England) Maine
1 Region 1 (The Northeast) Division 1 (New England) New Hampshire
2 Region 1 (The Northeast) Division 1 (New England) Vermont
3 Region 1 (The Northeast) Division 1 (New England) Massachusetts
4 Region 1 (The Northeast) Division 1 (New England) Rhode Island
5 Region 1 (The Northeast) Division 1 (New England) Connecticut
6 Region 1 (The Northeast) Division 2 (Middle Atlantic) New York
7 Region 1 (The Northeast) Division 2 (Middle Atlantic) Pennsylvania
8 Region 1 (The Northeast) Division 2 (Middle Atlantic) New Jersey
9 Region 2 (The Midwest) Division 3 (East North Central) Wisconsin
10 Region 2 (The Midwest) Division 3 (East North Central) Michigan
11 Region 2 (The Midwest) Division 3 (East North Central) Illinois
12 Region 2 (The Midwest) Division 3 (East North Central) Indiana
13 Region 2 (The Midwest) Division 3 (East North Central) Ohio
...
...