Pandas:如何创建一个运行计数列?
Pandas: how to create a running count column?
我有一个平面文本文件,格式为(我添加的第 headers 列)
CASE Diagnosis
S1 no diagnosis
S2 fungus
squamous lesion
S3 fungus
S4 squamous lesion
glandular lesion
atypia
我想堆叠和取消堆叠多个诊断的病例,所以我想
CASE DxN Diagnosis
S1 A no diagnosis
S2 A fungus
B squamous lesion
S3 A fungus
S4 A squamous lesion
B glandular lesion
C atypia
和
CASE A B C
S1 no diagnosis
S2 fungus squamous lesion
S3 fungus
S4 squamous lesion glandular lesion atypia
如何制作该子系列 DxN?计数不应大于 F。即使有 10,000 个可能的答案,每个案例也不会超过 6 个,因此不超过 6 列。我只想要 "What is diagnosis A for case S1, what's diagnosis B for case S1, what's diagnosis 3 for case S1?" 我不想为每个可能的答案都设置一个专栏。
您可以创建一个包含每个病例 运行 总诊断数的列。有关详细信息,请参阅此 post:SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe
使用此样本数据:
df = pd.DataFrame([
{'Case': 'S1', 'Diagnosis': 'no diagnosis'},
{'Case': 'S2', 'Diagnosis': 'fungus'},
{'Case': 'S2', 'Diagnosis': 'squamous lesion'}
])
此脚本将为您提供 运行 总数:
df['DxN'] = df.sort_values(['Case'], ascending=[1]).groupby('Case').cumcount() + 1
这是您需要的吗?
df=df.replace('',np.nan).ffill()
df.assign(DxN=df.groupby('CASE').cumcount()).set_index(['CASE','DxN']).Diagnosis.unstack(fill_value='')
Out[709]:
DxN 0 1
CASE
S1 nodiagnosis
S2 fungus squamouslesion
S3 fungus
S4 squamouslesion glandularlesion
这是一种方法,从您拥有的文本格式的数据开始:
import pandas as pd
import numpy as np
df = pd.DataFrame([['S1', 'no diagnosis'],
['S2', 'fungus'],
['', 'squamous lesion'],
['S3', 'fungus'],
['S4', 'squamous lesion'],
['', 'glandular lesion']],
columns=['CASE', 'Diagnosis'])
# front fill CASE series
df.CASE = df.CASE.replace('', np.nan).ffill()
# pivot data
df = pd.pivot_table(df, index=['CASE'], values=['Diagnosis'],
aggfunc=lambda x: list(x)).reset_index()
# split columns of lists into separate columns
df = pd.concat([df[['CASE']], pd.DataFrame(df['Diagnosis'].values.tolist())], axis=1)
# CASE 0 1
# 0 S1 no diagnosis None
# 1 S2 fungus squamous lesion
# 2 S3 fungus None
# 3 S4 squamous lesion glandular lesion
我有一个平面文本文件,格式为(我添加的第 headers 列)
CASE Diagnosis
S1 no diagnosis
S2 fungus
squamous lesion
S3 fungus
S4 squamous lesion
glandular lesion
atypia
我想堆叠和取消堆叠多个诊断的病例,所以我想
CASE DxN Diagnosis
S1 A no diagnosis
S2 A fungus
B squamous lesion
S3 A fungus
S4 A squamous lesion
B glandular lesion
C atypia
和
CASE A B C
S1 no diagnosis
S2 fungus squamous lesion
S3 fungus
S4 squamous lesion glandular lesion atypia
如何制作该子系列 DxN?计数不应大于 F。即使有 10,000 个可能的答案,每个案例也不会超过 6 个,因此不超过 6 列。我只想要 "What is diagnosis A for case S1, what's diagnosis B for case S1, what's diagnosis 3 for case S1?" 我不想为每个可能的答案都设置一个专栏。
您可以创建一个包含每个病例 运行 总诊断数的列。有关详细信息,请参阅此 post:SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe
使用此样本数据:
df = pd.DataFrame([
{'Case': 'S1', 'Diagnosis': 'no diagnosis'},
{'Case': 'S2', 'Diagnosis': 'fungus'},
{'Case': 'S2', 'Diagnosis': 'squamous lesion'}
])
此脚本将为您提供 运行 总数:
df['DxN'] = df.sort_values(['Case'], ascending=[1]).groupby('Case').cumcount() + 1
这是您需要的吗?
df=df.replace('',np.nan).ffill()
df.assign(DxN=df.groupby('CASE').cumcount()).set_index(['CASE','DxN']).Diagnosis.unstack(fill_value='')
Out[709]:
DxN 0 1
CASE
S1 nodiagnosis
S2 fungus squamouslesion
S3 fungus
S4 squamouslesion glandularlesion
这是一种方法,从您拥有的文本格式的数据开始:
import pandas as pd
import numpy as np
df = pd.DataFrame([['S1', 'no diagnosis'],
['S2', 'fungus'],
['', 'squamous lesion'],
['S3', 'fungus'],
['S4', 'squamous lesion'],
['', 'glandular lesion']],
columns=['CASE', 'Diagnosis'])
# front fill CASE series
df.CASE = df.CASE.replace('', np.nan).ffill()
# pivot data
df = pd.pivot_table(df, index=['CASE'], values=['Diagnosis'],
aggfunc=lambda x: list(x)).reset_index()
# split columns of lists into separate columns
df = pd.concat([df[['CASE']], pd.DataFrame(df['Diagnosis'].values.tolist())], axis=1)
# CASE 0 1
# 0 S1 no diagnosis None
# 1 S2 fungus squamous lesion
# 2 S3 fungus None
# 3 S4 squamous lesion glandular lesion