Denormalize/RestructureCDISC补充资料
Denormalize/Restructure CDISC supplement data
我有一个如下所示的数据框:
USUBJID IDVAR IDVARVAL QNAM QVAL
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
5 Dummy-01-0001 AESPID 1 dummy2AE Gastrointestinal disorders
6 Dummy-01-0002 AESPID 1 dummy2AE Nervous system disorder
7 Dummy-01-0004 AESPID 2 dummy2AE Gastrointestinal disorders
8 Dummy-01-0004 AESPID 1 dummy2AE Nervous system disorder
9 Dummy-01-0005 AESPID 1 dummy2AE Gastrointestinal disorders
以上数据框只是一个示例,请使用以下代码将更多数据放入数据框。
df = pd.DataFrame({'USUBJID': {0: 'Dummy-01-0001', 1: 'Dummy-01-0002', 2: 'Dummy-01-0004', 3: 'Dummy-01-0004', 4: 'Dummy-01-0005', 5: 'Dummy-01-0007', 6: 'Dummy-01-0008', 7: 'Dummy-01-0008', 8: 'Dummy-01-0010', 9: 'Dummy-01-0010', 10: 'Dummy-01-0011', 11: 'Dummy-01-0011', 12: 'Dummy-01-0013', 13: 'Dummy-01-0013', 14: 'Dummy-01-0014', 15: 'Dummy-01-0016', 16: 'Dummy-01-0016', 17: 'Dummy-01-0019', 18: 'Dummy-01-0001', 19: 'Dummy-01-0002', 20: 'Dummy-01-0004', 21: 'Dummy-01-0004', 22: 'Dummy-01-0005', 23: 'Dummy-01-0007', 24: 'Dummy-01-0008', 25: 'Dummy-01-0008', 26: 'Dummy-01-0010', 27: 'Dummy-01-0010', 28: 'Dummy-01-0011', 29: 'Dummy-01-0011', 30: 'Dummy-01-0013', 31: 'Dummy-01-0013', 32: 'Dummy-01-0014', 33: 'Dummy-01-0016', 34: 'Dummy-01-0016', 35: 'Dummy-01-0017', 36: 'Dummy-01-0017', 37: 'Dummy-01-0019'}, 'IDVAR': {0: 'AESEQ', 1: 'AESEQ', 2: 'AESEQ', 3: 'AESEQ', 4: 'AESEQ', 5: 'AESEQ', 6: 'AESEQ', 7: 'AESEQ', 8: 'AESEQ', 9: 'AESEQ', 10: 'AESEQ', 11: 'AESEQ', 12: 'AESEQ', 13: 'AESEQ', 14: 'AESEQ', 15: 'AESEQ', 16: 'AESEQ', 17: 'AESEQ', 18: 'AESPID', 19: 'AESPID', 20: 'AESPID', 21: 'AESPID', 22: 'AESPID', 23: 'AESPID', 24: 'AESPID', 25: 'AESPID', 26: 'AESPID', 27: 'AESPID', 28: 'AESPID', 29: 'AESPID', 30: 'AESPID', 31: 'AESPID', 32: 'AESPID', 33: 'AESPID', 34: 'AESPID', 35: 'AESPID', 36: 'AESPID', 37: 'AESPID'}, 'IDVARVAL': {0: '1.0', 1: '1.0', 2: '1.0', 3: '2.0', 4: '1.0', 5: '1.0', 6: '1.0', 7: '2.0', 8: '1.0', 9: '2.0', 10: '1.0', 11: '2.0', 12: '1.0', 13: '2.0', 14: '1.0', 15: '1.0', 16: '2.0', 17: '1.0', 18: '1', 19: '1', 20: '2', 21: '1', 22: '1', 23: '1', 24: '1', 25: '2', 26: '2', 27: '1', 28: '1', 29: '2', 30: '1', 31: '2', 32: '1', 33: '1', 34: '2', 35: '1', 36: '2', 37: '1'}, 'QNAM': {0: 'AEdummy1', 1: 'AEdummy1', 2: 'AEdummy1', 3: 'AEdummy1', 4: 'AEdummy1', 5: 'AEdummy1', 6: 'AEdummy1', 7: 'AEdummy1', 8: 'AEdummy1', 9: 'AEdummy1', 10: 'AEdummy1', 11: 'AEdummy1', 12: 'AEdummy1', 13: 'AEdummy1', 14: 'AEdummy1', 15: 'AEdummy1', 16: 'AEdummy1', 17: 'AEdummy1', 18: 'dummy2AE', 19: 'dummy2AE', 20: 'dummy2AE', 21: 'dummy2AE', 22: 'dummy2AE', 23: 'dummy2AE', 24: 'dummy2AE', 25: 'dummy2AE', 26: 'dummy2AE', 27: 'dummy2AE', 28: 'dummy2AE', 29: 'dummy2AE', 30: 'dummy2AE', 31: 'dummy2AE', 32: 'dummy2AE', 33: 'dummy2AE', 34: 'dummy2AE', 35: 'dummy2AE', 36: 'dummy2AE', 37: 'dummy2AE'}, 'QVAL': {0: '2012-02-15', 1: '2012-02-23', 2: '2012-02-06', 3: '2012-03-10', 4: '2012-03-10', 5: '2012-02-08', 6: '2012-03-18', 7: '2012-03-07', 8: '2012-02-01', 9: '2012-01-10', 10: '2012-01-19', 11: '2012-03-28', 12: '2012-02-19', 13: '2012-02-14', 14: '2012-03-13', 15: '2012-03-08', 16: '2012-02-05', 17: '2012-03-18', 18: 'Gastrointestinal disorders', 19: 'Nervous system disorder', 20: 'Gastrointestinal disorders', 21: 'Nervous system disorder', 22: 'Gastrointestinal disorders', 23: 'Vascular disorders', 24: 'Gastrointestinal disorders', 25: 'Vascular disorders', 26: 'Nervous system disorder', 27: 'Gastrointestinal disorders', 28: 'Nervous system disorder', 29: 'Nervous system disorder', 30: 'Nervous system disorder', 31: 'Gastrointestinal disorders', 32: 'Gastrointestinal disorders', 33: 'Vascular disorders', 34: 'Gastrointestinal disorders', 35: 'Nervous system disorder', 36: 'Nervous system disorder', 37: 'Vascular disorders'}})
让我们稍微了解一下结构。 IDVAR
保存变量名,IDVARVAL
保存它的值,QNAM
保存另一个变量名,QVAL
保存相应的值。
这种结构,在我工作的领域被称为规范化结构,虽然我知道它很奇怪。
我想获取以下形式的数据:
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1 2012-03-10 Gastrointestinal disorders
如果我只需要为 QNAM
和 QVAL
做这件事,没有问题,我可以使用 pandas' pivot 轻松完成,如下所示:
df.pivot(
index=['USUBJID', 'IDVARVAL', 'IDVAR'],
columns='QNAM',
values='QVAL'
).reset_index()
即使对于这个问题,我也可以使用某种掩码,但我知道这不会是最有效的方法,因为该数据将包含数千条记录,有时甚至是数百万条记录。
请注意:IDVAR
的 USUBJID
和 IDVARVAL
的组合应映射到 QNAM
的 QVAL
的值.也就是说,AESEQ
、AESPID
、AEdummy1
和 dummy2AE
的一些记录对于输出 DataFrame 可能是 NaN 或空的。也就是说,对于上面的样本,Dummy-01-0001
和AESEQ=1.0
唯一标识了AEdummy1 = 2012-02-15
.
让我们尝试使用 unstacking
然后是 concat
:
s = df.set_index([df.groupby('IDVAR').cumcount(), 'USUBJID'])
s1 = s.set_index('IDVAR', append=True)['IDVARVAL'].unstack()
s2 = s.set_index('QNAM', append=True)['QVAL'].unstack()
out = pd.concat([s1, s2], axis=1).reset_index(level=1)
解释
在 IDVAR
上对数据帧进行分组并使用 cumcount
创建一个顺序计数器以唯一标识每个 IDVAR
组的行,然后将此计数器与列一起设置 USUBJID
作为数据帧的索引:
>>> s
IDVAR IDVARVAL QNAM QVAL
USUBJID
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
0 Dummy-01-0001 AESPID 1.0 dummy2AE Gastrointestinal disorders
1 Dummy-01-0002 AESPID 1.0 dummy2AE Nervous system disorder
2 Dummy-01-0004 AESPID 2.0 dummy2AE Gastrointestinal disorders
3 Dummy-01-0004 AESPID 1.0 dummy2AE Nervous system disorder
4 Dummy-01-0005 AESPID 1.0 dummy2AE Gastrointestinal disorders
现在对于每个列 IDVARVAL
和 QVAL
通过附加相应的列更新索引 IDVAR
和 QNAM
然后 unstack
重塑:
>>> s1
IDVAR AESEQ AESPID
USUBJID
0 Dummy-01-0001 1.0 1.0
1 Dummy-01-0002 1.0 1.0
2 Dummy-01-0004 1.0 2.0
3 Dummy-01-0004 2.0 1.0
4 Dummy-01-0005 1.0 1.0
>>> s2
QNAM AEdummy1 dummy2AE
USUBJID
0 Dummy-01-0001 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 2012-02-23 Nervous system disorder
2 Dummy-01-0004 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2012-03-10 Nervous system disorder
4 Dummy-01-0005 2012-03-10 Gastrointestinal disorders
最后,concat
上面的未堆叠帧 s1
和 s2
沿着 axis=1
得到想要的结果:
>>> out
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1.0 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1.0 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2.0 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1.0 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1.0 2012-03-10 Gastrointestinal disorders
您可以使用 pivot_longer from pyjanitor, create a counter with cumcount to create a unique index, pivot 数据重塑数据,并进行一些最终清理以获得与预期输出类似的结果:
(
df.pivot_longer(["USUBJID"],
names_pattern=["IDVAR$|QNAM$", "IDVARVAL$|QVAL$"],
names_to=["variables", "values"])
.assign(counter=lambda df: df.groupby(["USUBJID", "variables"]).cumcount())
.pivot(["USUBJID", "counter"], "variables", "values")
.droplevel("counter")
.rename_axis(columns=None)
.reset_index()
)
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1.0 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1.0 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2.0 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1.0 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1.0 2012-03-10 Gastrointestinal disorders
我有一个如下所示的数据框:
USUBJID IDVAR IDVARVAL QNAM QVAL
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
5 Dummy-01-0001 AESPID 1 dummy2AE Gastrointestinal disorders
6 Dummy-01-0002 AESPID 1 dummy2AE Nervous system disorder
7 Dummy-01-0004 AESPID 2 dummy2AE Gastrointestinal disorders
8 Dummy-01-0004 AESPID 1 dummy2AE Nervous system disorder
9 Dummy-01-0005 AESPID 1 dummy2AE Gastrointestinal disorders
以上数据框只是一个示例,请使用以下代码将更多数据放入数据框。
df = pd.DataFrame({'USUBJID': {0: 'Dummy-01-0001', 1: 'Dummy-01-0002', 2: 'Dummy-01-0004', 3: 'Dummy-01-0004', 4: 'Dummy-01-0005', 5: 'Dummy-01-0007', 6: 'Dummy-01-0008', 7: 'Dummy-01-0008', 8: 'Dummy-01-0010', 9: 'Dummy-01-0010', 10: 'Dummy-01-0011', 11: 'Dummy-01-0011', 12: 'Dummy-01-0013', 13: 'Dummy-01-0013', 14: 'Dummy-01-0014', 15: 'Dummy-01-0016', 16: 'Dummy-01-0016', 17: 'Dummy-01-0019', 18: 'Dummy-01-0001', 19: 'Dummy-01-0002', 20: 'Dummy-01-0004', 21: 'Dummy-01-0004', 22: 'Dummy-01-0005', 23: 'Dummy-01-0007', 24: 'Dummy-01-0008', 25: 'Dummy-01-0008', 26: 'Dummy-01-0010', 27: 'Dummy-01-0010', 28: 'Dummy-01-0011', 29: 'Dummy-01-0011', 30: 'Dummy-01-0013', 31: 'Dummy-01-0013', 32: 'Dummy-01-0014', 33: 'Dummy-01-0016', 34: 'Dummy-01-0016', 35: 'Dummy-01-0017', 36: 'Dummy-01-0017', 37: 'Dummy-01-0019'}, 'IDVAR': {0: 'AESEQ', 1: 'AESEQ', 2: 'AESEQ', 3: 'AESEQ', 4: 'AESEQ', 5: 'AESEQ', 6: 'AESEQ', 7: 'AESEQ', 8: 'AESEQ', 9: 'AESEQ', 10: 'AESEQ', 11: 'AESEQ', 12: 'AESEQ', 13: 'AESEQ', 14: 'AESEQ', 15: 'AESEQ', 16: 'AESEQ', 17: 'AESEQ', 18: 'AESPID', 19: 'AESPID', 20: 'AESPID', 21: 'AESPID', 22: 'AESPID', 23: 'AESPID', 24: 'AESPID', 25: 'AESPID', 26: 'AESPID', 27: 'AESPID', 28: 'AESPID', 29: 'AESPID', 30: 'AESPID', 31: 'AESPID', 32: 'AESPID', 33: 'AESPID', 34: 'AESPID', 35: 'AESPID', 36: 'AESPID', 37: 'AESPID'}, 'IDVARVAL': {0: '1.0', 1: '1.0', 2: '1.0', 3: '2.0', 4: '1.0', 5: '1.0', 6: '1.0', 7: '2.0', 8: '1.0', 9: '2.0', 10: '1.0', 11: '2.0', 12: '1.0', 13: '2.0', 14: '1.0', 15: '1.0', 16: '2.0', 17: '1.0', 18: '1', 19: '1', 20: '2', 21: '1', 22: '1', 23: '1', 24: '1', 25: '2', 26: '2', 27: '1', 28: '1', 29: '2', 30: '1', 31: '2', 32: '1', 33: '1', 34: '2', 35: '1', 36: '2', 37: '1'}, 'QNAM': {0: 'AEdummy1', 1: 'AEdummy1', 2: 'AEdummy1', 3: 'AEdummy1', 4: 'AEdummy1', 5: 'AEdummy1', 6: 'AEdummy1', 7: 'AEdummy1', 8: 'AEdummy1', 9: 'AEdummy1', 10: 'AEdummy1', 11: 'AEdummy1', 12: 'AEdummy1', 13: 'AEdummy1', 14: 'AEdummy1', 15: 'AEdummy1', 16: 'AEdummy1', 17: 'AEdummy1', 18: 'dummy2AE', 19: 'dummy2AE', 20: 'dummy2AE', 21: 'dummy2AE', 22: 'dummy2AE', 23: 'dummy2AE', 24: 'dummy2AE', 25: 'dummy2AE', 26: 'dummy2AE', 27: 'dummy2AE', 28: 'dummy2AE', 29: 'dummy2AE', 30: 'dummy2AE', 31: 'dummy2AE', 32: 'dummy2AE', 33: 'dummy2AE', 34: 'dummy2AE', 35: 'dummy2AE', 36: 'dummy2AE', 37: 'dummy2AE'}, 'QVAL': {0: '2012-02-15', 1: '2012-02-23', 2: '2012-02-06', 3: '2012-03-10', 4: '2012-03-10', 5: '2012-02-08', 6: '2012-03-18', 7: '2012-03-07', 8: '2012-02-01', 9: '2012-01-10', 10: '2012-01-19', 11: '2012-03-28', 12: '2012-02-19', 13: '2012-02-14', 14: '2012-03-13', 15: '2012-03-08', 16: '2012-02-05', 17: '2012-03-18', 18: 'Gastrointestinal disorders', 19: 'Nervous system disorder', 20: 'Gastrointestinal disorders', 21: 'Nervous system disorder', 22: 'Gastrointestinal disorders', 23: 'Vascular disorders', 24: 'Gastrointestinal disorders', 25: 'Vascular disorders', 26: 'Nervous system disorder', 27: 'Gastrointestinal disorders', 28: 'Nervous system disorder', 29: 'Nervous system disorder', 30: 'Nervous system disorder', 31: 'Gastrointestinal disorders', 32: 'Gastrointestinal disorders', 33: 'Vascular disorders', 34: 'Gastrointestinal disorders', 35: 'Nervous system disorder', 36: 'Nervous system disorder', 37: 'Vascular disorders'}})
让我们稍微了解一下结构。 IDVAR
保存变量名,IDVARVAL
保存它的值,QNAM
保存另一个变量名,QVAL
保存相应的值。
这种结构,在我工作的领域被称为规范化结构,虽然我知道它很奇怪。
我想获取以下形式的数据:
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1 2012-03-10 Gastrointestinal disorders
如果我只需要为 QNAM
和 QVAL
做这件事,没有问题,我可以使用 pandas' pivot 轻松完成,如下所示:
df.pivot(
index=['USUBJID', 'IDVARVAL', 'IDVAR'],
columns='QNAM',
values='QVAL'
).reset_index()
即使对于这个问题,我也可以使用某种掩码,但我知道这不会是最有效的方法,因为该数据将包含数千条记录,有时甚至是数百万条记录。
请注意:IDVAR
的 USUBJID
和 IDVARVAL
的组合应映射到 QNAM
的 QVAL
的值.也就是说,AESEQ
、AESPID
、AEdummy1
和 dummy2AE
的一些记录对于输出 DataFrame 可能是 NaN 或空的。也就是说,对于上面的样本,Dummy-01-0001
和AESEQ=1.0
唯一标识了AEdummy1 = 2012-02-15
.
让我们尝试使用 unstacking
然后是 concat
:
s = df.set_index([df.groupby('IDVAR').cumcount(), 'USUBJID'])
s1 = s.set_index('IDVAR', append=True)['IDVARVAL'].unstack()
s2 = s.set_index('QNAM', append=True)['QVAL'].unstack()
out = pd.concat([s1, s2], axis=1).reset_index(level=1)
解释
在 IDVAR
上对数据帧进行分组并使用 cumcount
创建一个顺序计数器以唯一标识每个 IDVAR
组的行,然后将此计数器与列一起设置 USUBJID
作为数据帧的索引:
>>> s
IDVAR IDVARVAL QNAM QVAL
USUBJID
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
0 Dummy-01-0001 AESPID 1.0 dummy2AE Gastrointestinal disorders
1 Dummy-01-0002 AESPID 1.0 dummy2AE Nervous system disorder
2 Dummy-01-0004 AESPID 2.0 dummy2AE Gastrointestinal disorders
3 Dummy-01-0004 AESPID 1.0 dummy2AE Nervous system disorder
4 Dummy-01-0005 AESPID 1.0 dummy2AE Gastrointestinal disorders
现在对于每个列 IDVARVAL
和 QVAL
通过附加相应的列更新索引 IDVAR
和 QNAM
然后 unstack
重塑:
>>> s1
IDVAR AESEQ AESPID
USUBJID
0 Dummy-01-0001 1.0 1.0
1 Dummy-01-0002 1.0 1.0
2 Dummy-01-0004 1.0 2.0
3 Dummy-01-0004 2.0 1.0
4 Dummy-01-0005 1.0 1.0
>>> s2
QNAM AEdummy1 dummy2AE
USUBJID
0 Dummy-01-0001 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 2012-02-23 Nervous system disorder
2 Dummy-01-0004 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2012-03-10 Nervous system disorder
4 Dummy-01-0005 2012-03-10 Gastrointestinal disorders
最后,concat
上面的未堆叠帧 s1
和 s2
沿着 axis=1
得到想要的结果:
>>> out
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1.0 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1.0 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2.0 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1.0 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1.0 2012-03-10 Gastrointestinal disorders
您可以使用 pivot_longer from pyjanitor, create a counter with cumcount to create a unique index, pivot 数据重塑数据,并进行一些最终清理以获得与预期输出类似的结果:
(
df.pivot_longer(["USUBJID"],
names_pattern=["IDVAR$|QNAM$", "IDVARVAL$|QVAL$"],
names_to=["variables", "values"])
.assign(counter=lambda df: df.groupby(["USUBJID", "variables"]).cumcount())
.pivot(["USUBJID", "counter"], "variables", "values")
.droplevel("counter")
.rename_axis(columns=None)
.reset_index()
)
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1.0 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1.0 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2.0 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1.0 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1.0 2012-03-10 Gastrointestinal disorders