使用 Pandas 重塑 csv:加入 df 的两个子集
Reshaping csv with Pandas: joining two subsets of df
我的 .csv 看起来像:
Res X XB XC O P
A312 76.55 - - - -
B313 175.4 62.28 32.62 8.189 121.2
J314 176.5 53.34 40.77 8.277 124.6
L315 177.9 55.29 41.44 8.427 125.5
T316 174.7 59.47 63.43 8.264 116.1
...
G378 10.2 58.91 40.13 7.646 126.7
我想像这样重塑它:
312 A X 76.55
313 B X 175.4
313 B XB 62.28
313 B XC 32.62
...
378 G O 7.646
378 G P 126.7
import pandas as pd
df1 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df2 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df1['Pos'] = df1['Res'].str[1:].astype(int)
df1['AA'] = df1['Res'].str[0]
df2.drop('Res', axis = 1, inplace = True)
a = df2.stack(level = -1)
b = df1[["Pos", "AA"]]
print(a)
print(b)
这产生:
来自 print(a)
的输出:
0 X 76.500
1 X 175.400
XB 62.280
XC 32.620
O 8.189
P 121.200
...
62 X 10.200
XB 58.910
XC 40.130
O 7.646
P 126.700
来自 print(b)
的输出:
0 312 A
1 313 B
2 314 J
3 315 L
...
62 378 G
关于如何完成最后一步的任何想法,即加入这两个 df,a
和 b
,并最终实现我想要的格式?我已经尝试了几个 pandas
功能,例如 pd.merge
、pd.join
和 pd.concat
。 None 其中似乎有效...
你想要melt
:
import pandas as pd
df = pd.read_csv("my_file.csv", delim_whitespace=True, index_col=False)
df['Res'] = df['Res'].str[0]
reshaped = df.melt('Res', ['X', 'XB', 'XC', 'O', 'P'])
print(reshaped.dropna().sort_values('Res').reset_index(drop=True))
输出:
Res variable value
0 A X 76.55
1 B O 8.189
2 B P 121.2
3 B X 175.4
4 B XB 62.28
5 B XC 32.62
6 J O 8.277
7 J P 124.6
8 J X 176.5
9 J XB 53.34
10 J XC 40.77
11 L O 8.427
12 L P 125.5
13 L X 177.9
14 L XB 55.29
15 L XC 41.44
16 T O 8.264
17 T P 116.1
18 T X 174.7
19 T XB 59.47
20 T XC 63.43
稍微改变了您的解决方案 - 首先添加 DataFrame.pop
for extract column - then df1.drop('Res', axis = 1, inplace = True)
is not necessary, then create MultiIndex
by DataFrame.set_index
and call DataFrame.stack
,最后一次数据清理 - reset_index
和 rename
:
df1 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df1['Pos'] = df1['Res'].str[1:].astype(int)
df1['AA'] = df1.pop('Res').str[0]
df = (df1.set_index(['Pos', 'AA'])
.stack()
.reset_index(name='new')
.rename(columns={'level_2':'cat'}))
print (df)
Pos AA cat new
0 312 A X 76.550
1 313 B X 175.400
2 313 B XB 62.280
3 313 B XC 32.620
4 313 B O 8.189
5 313 B P 121.200
6 314 J X 176.500
7 314 J XB 53.340
8 314 J XC 40.770
9 314 J O 8.277
10 314 J P 124.600
11 315 L X 177.900
12 315 L XB 55.290
13 315 L XC 41.440
14 315 L O 8.427
15 315 L P 125.500
16 316 T X 174.700
17 316 T XB 59.470
18 316 T XC 63.430
19 316 T O 8.264
20 316 T P 116.100
21 378 G X 10.200
22 378 G XB 58.910
23 378 G XC 40.130
24 378 G O 7.646
25 378 G P 126.700
我的 .csv 看起来像:
Res X XB XC O P
A312 76.55 - - - -
B313 175.4 62.28 32.62 8.189 121.2
J314 176.5 53.34 40.77 8.277 124.6
L315 177.9 55.29 41.44 8.427 125.5
T316 174.7 59.47 63.43 8.264 116.1
...
G378 10.2 58.91 40.13 7.646 126.7
我想像这样重塑它:
312 A X 76.55
313 B X 175.4
313 B XB 62.28
313 B XC 32.62
...
378 G O 7.646
378 G P 126.7
import pandas as pd
df1 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df2 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df1['Pos'] = df1['Res'].str[1:].astype(int)
df1['AA'] = df1['Res'].str[0]
df2.drop('Res', axis = 1, inplace = True)
a = df2.stack(level = -1)
b = df1[["Pos", "AA"]]
print(a)
print(b)
这产生:
来自 print(a)
的输出:
0 X 76.500
1 X 175.400
XB 62.280
XC 32.620
O 8.189
P 121.200
...
62 X 10.200
XB 58.910
XC 40.130
O 7.646
P 126.700
来自 print(b)
的输出:
0 312 A
1 313 B
2 314 J
3 315 L
...
62 378 G
关于如何完成最后一步的任何想法,即加入这两个 df,a
和 b
,并最终实现我想要的格式?我已经尝试了几个 pandas
功能,例如 pd.merge
、pd.join
和 pd.concat
。 None 其中似乎有效...
你想要melt
:
import pandas as pd
df = pd.read_csv("my_file.csv", delim_whitespace=True, index_col=False)
df['Res'] = df['Res'].str[0]
reshaped = df.melt('Res', ['X', 'XB', 'XC', 'O', 'P'])
print(reshaped.dropna().sort_values('Res').reset_index(drop=True))
输出:
Res variable value
0 A X 76.55
1 B O 8.189
2 B P 121.2
3 B X 175.4
4 B XB 62.28
5 B XC 32.62
6 J O 8.277
7 J P 124.6
8 J X 176.5
9 J XB 53.34
10 J XC 40.77
11 L O 8.427
12 L P 125.5
13 L X 177.9
14 L XB 55.29
15 L XC 41.44
16 T O 8.264
17 T P 116.1
18 T X 174.7
19 T XB 59.47
20 T XC 63.43
稍微改变了您的解决方案 - 首先添加 DataFrame.pop
for extract column - then df1.drop('Res', axis = 1, inplace = True)
is not necessary, then create MultiIndex
by DataFrame.set_index
and call DataFrame.stack
,最后一次数据清理 - reset_index
和 rename
:
df1 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df1['Pos'] = df1['Res'].str[1:].astype(int)
df1['AA'] = df1.pop('Res').str[0]
df = (df1.set_index(['Pos', 'AA'])
.stack()
.reset_index(name='new')
.rename(columns={'level_2':'cat'}))
print (df)
Pos AA cat new
0 312 A X 76.550
1 313 B X 175.400
2 313 B XB 62.280
3 313 B XC 32.620
4 313 B O 8.189
5 313 B P 121.200
6 314 J X 176.500
7 314 J XB 53.340
8 314 J XC 40.770
9 314 J O 8.277
10 314 J P 124.600
11 315 L X 177.900
12 315 L XB 55.290
13 315 L XC 41.440
14 315 L O 8.427
15 315 L P 125.500
16 316 T X 174.700
17 316 T XB 59.470
18 316 T XC 63.430
19 316 T O 8.264
20 316 T P 116.100
21 378 G X 10.200
22 378 G XB 58.910
23 378 G XC 40.130
24 378 G O 7.646
25 378 G P 126.700