pandas wide_long vs (stack and melt) for data-frame transformation
pandas wide_long vs (stack and melt) for data-frame transformation
我有一个如下所示的数据框
df = pd.DataFrame({
'subject_ID':[1,2,3,4,5],
'date_visit':['1/1/2020','3/3/2200','13/11/2100','24/05/2198','30/03/2071'],
'a11fever':['Yes','No','Yes','Yes','No'],
'a12diagage':[36,34,42,40,np.nan],
'a12diagyr':[2021,3213,2091,4567,8901],
'a12diagyrago':[6,np.nan,9,np.nan,np.nan]})
我想转换数据帧,其中 一个主题的样本输出 如下所示
虽然我能够使用 pd.melt
和 stack
成功地做到这一点,但我无法使用 wide_long
.
做到同样的事情
pd.melt(df, id_vars =['subject_ID','date_visit'], value_vars =['a11fever', 'a12diagage', 'a12diagyr','a12diagyrago']) # works fine
pd.wide_to_long(df, stubnames=['measurement', 'val'],i=(['subject_ID','date_visit']), j='grp').sort_index(level=0) # returns 0 records
df.set_index(['subject_ID','date_visit']).stack().reset_index() #works fine
我的另一个问题是,
a) 我们是否总是必须在 pd.melt
的 value_vars
部分下提及我们想要转换的所有列名称。我的真实数据将有 120 多列。那么我必须在这里一一列举吗?
你能帮我解决一下如何使用 wide_long
吗?
Do we always have to mention all the column names that we would like to transform under value_vars section of pd.melt. My real data will have more than 120 columns. So do I have to mention all of them here?
不,没有必要,如果省略参数 value_vars
那么将使用所有列而不用于 id_vars
:
df = pd.melt(df, id_vars =['subject_ID','date_visit'])
print (df)
subject_ID date_visit variable value
0 1 1/1/2020 a11fever Yes
1 2 3/3/2200 a11fever No
2 3 13/11/2100 a11fever Yes
3 4 24/05/2198 a11fever Yes
4 5 30/03/2071 a11fever No
5 1 1/1/2020 a12diagage 36
6 2 3/3/2200 a12diagage 34
7 3 13/11/2100 a12diagage 42
8 4 24/05/2198 a12diagage 40
9 5 30/03/2071 a12diagage NaN
10 1 1/1/2020 a12diagyr 2021
11 2 3/3/2200 a12diagyr 3213
12 3 13/11/2100 a12diagyr 2091
13 4 24/05/2198 a12diagyr 4567
14 5 30/03/2071 a12diagyr 8901
15 1 1/1/2020 a12diagyrago 6
16 2 3/3/2200 a12diagyrago NaN
17 3 13/11/2100 a12diagyrago 9
18 4 24/05/2198 a12diagyrago NaN
19 5 30/03/2071 a12diagyrago NaN
这不是 pd.wide_to_long
的用例,因为它会生成不正确的输出。您必须使用 stubnames
,这些将被转换为列 (a11
& a12
)。参见示例:
melt = pd.wide_to_long(df,
i=['subject_ID', 'date_visit'],
stubnames=['a11', 'a12'],
suffix='\D+',
j='fever_diag').reset_index()
subject_ID date_visit fever_diag a11 a12
0 1 1/1/2020 diagage NaN 36.0
1 1 1/1/2020 diagyr NaN 2021.0
2 1 1/1/2020 diagyrago NaN 6.0
3 1 1/1/2020 fever Yes NaN
4 2 3/3/2200 diagage NaN 34.0
5 2 3/3/2200 diagyr NaN 3213.0
6 2 3/3/2200 diagyrago NaN NaN
7 2 3/3/2200 fever No NaN
8 3 13/11/2100 diagage NaN 42.0
9 3 13/11/2100 diagyr NaN 2091.0
10 3 13/11/2100 diagyrago NaN 9.0
11 3 13/11/2100 fever Yes NaN
12 4 24/05/2198 diagage NaN 40.0
13 4 24/05/2198 diagyr NaN 4567.0
14 4 24/05/2198 diagyrago NaN NaN
15 4 24/05/2198 fever Yes NaN
16 5 30/03/2071 diagage NaN NaN
17 5 30/03/2071 diagyr NaN 8901.0
18 5 30/03/2071 diagyrago NaN NaN
19 5 30/03/2071 fever No NaN
我有一个如下所示的数据框
df = pd.DataFrame({
'subject_ID':[1,2,3,4,5],
'date_visit':['1/1/2020','3/3/2200','13/11/2100','24/05/2198','30/03/2071'],
'a11fever':['Yes','No','Yes','Yes','No'],
'a12diagage':[36,34,42,40,np.nan],
'a12diagyr':[2021,3213,2091,4567,8901],
'a12diagyrago':[6,np.nan,9,np.nan,np.nan]})
我想转换数据帧,其中 一个主题的样本输出 如下所示
虽然我能够使用 pd.melt
和 stack
成功地做到这一点,但我无法使用 wide_long
.
pd.melt(df, id_vars =['subject_ID','date_visit'], value_vars =['a11fever', 'a12diagage', 'a12diagyr','a12diagyrago']) # works fine
pd.wide_to_long(df, stubnames=['measurement', 'val'],i=(['subject_ID','date_visit']), j='grp').sort_index(level=0) # returns 0 records
df.set_index(['subject_ID','date_visit']).stack().reset_index() #works fine
我的另一个问题是,
a) 我们是否总是必须在 pd.melt
的 value_vars
部分下提及我们想要转换的所有列名称。我的真实数据将有 120 多列。那么我必须在这里一一列举吗?
你能帮我解决一下如何使用 wide_long
吗?
Do we always have to mention all the column names that we would like to transform under value_vars section of pd.melt. My real data will have more than 120 columns. So do I have to mention all of them here?
不,没有必要,如果省略参数 value_vars
那么将使用所有列而不用于 id_vars
:
df = pd.melt(df, id_vars =['subject_ID','date_visit'])
print (df)
subject_ID date_visit variable value
0 1 1/1/2020 a11fever Yes
1 2 3/3/2200 a11fever No
2 3 13/11/2100 a11fever Yes
3 4 24/05/2198 a11fever Yes
4 5 30/03/2071 a11fever No
5 1 1/1/2020 a12diagage 36
6 2 3/3/2200 a12diagage 34
7 3 13/11/2100 a12diagage 42
8 4 24/05/2198 a12diagage 40
9 5 30/03/2071 a12diagage NaN
10 1 1/1/2020 a12diagyr 2021
11 2 3/3/2200 a12diagyr 3213
12 3 13/11/2100 a12diagyr 2091
13 4 24/05/2198 a12diagyr 4567
14 5 30/03/2071 a12diagyr 8901
15 1 1/1/2020 a12diagyrago 6
16 2 3/3/2200 a12diagyrago NaN
17 3 13/11/2100 a12diagyrago 9
18 4 24/05/2198 a12diagyrago NaN
19 5 30/03/2071 a12diagyrago NaN
这不是 pd.wide_to_long
的用例,因为它会生成不正确的输出。您必须使用 stubnames
,这些将被转换为列 (a11
& a12
)。参见示例:
melt = pd.wide_to_long(df,
i=['subject_ID', 'date_visit'],
stubnames=['a11', 'a12'],
suffix='\D+',
j='fever_diag').reset_index()
subject_ID date_visit fever_diag a11 a12
0 1 1/1/2020 diagage NaN 36.0
1 1 1/1/2020 diagyr NaN 2021.0
2 1 1/1/2020 diagyrago NaN 6.0
3 1 1/1/2020 fever Yes NaN
4 2 3/3/2200 diagage NaN 34.0
5 2 3/3/2200 diagyr NaN 3213.0
6 2 3/3/2200 diagyrago NaN NaN
7 2 3/3/2200 fever No NaN
8 3 13/11/2100 diagage NaN 42.0
9 3 13/11/2100 diagyr NaN 2091.0
10 3 13/11/2100 diagyrago NaN 9.0
11 3 13/11/2100 fever Yes NaN
12 4 24/05/2198 diagage NaN 40.0
13 4 24/05/2198 diagyr NaN 4567.0
14 4 24/05/2198 diagyrago NaN NaN
15 4 24/05/2198 fever Yes NaN
16 5 30/03/2071 diagage NaN NaN
17 5 30/03/2071 diagyr NaN 8901.0
18 5 30/03/2071 diagyrago NaN NaN
19 5 30/03/2071 fever No NaN