如何添加两列DataFrame并使用前缀名称重命名
How to Add Two Columns of DataFrame and Rename it with Prefix Name
The original Data looks like
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1 C T G T C A 0 0
2 C C T G A A G T
3 A A G G C G A A
4 G G C C A A T A
注:
- 像上面一样,我需要合并 522 行和 369 列(单独的母亲和父亲 SNP 值)
- 每个 SNP 的长度不同(例如:kgp11274425 和 rs707)
我正在处理 GWAS 数据,这些是我们包含一条母染色体和一条父染色体的细胞的 SNP id。我想将 M & F 的单个 SNP 值合并为一个并用 SNP ID 命名它 (kgp11274425_A + kgp11274425_HET = kgp11274425)
Desired Output:
ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA
Can Anyone please help me, all support & help needed
只需按以下方式更改 unique_cols
:
from io import StringIO
data = StringIO("""ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F,
1, C, T, G, T,
2, C, C, T, G,
3, A, A, G, G,
4, G, G, C, C""")
df = pd.read_csv(data, sep=",")
cols = ['ID', ' kgp11274425_M', ' kgp11274425_F', ' kgp5732633_M',
' kgp5732633_F']
df = df[cols]
df = df.set_index('ID')
#here######################################################
sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique()
#here######################################################
results = []
columns = []
for col in unique_cols:
my_cols = [x for x in df.columns if x.startswith(col)]
results.append(df[my_cols].sum(axis=1).values)
columns.append(col)
new_df = pd.DataFrame(results).T
new_df.columns = columns
一个选项是 pandas' groupby,然后是迭代:
temp = df.set_index('ID')
wrapper = temp.groupby(temp.columns.str.split('_').str[0], axis = 1)
wrapper.sum().reset_index()
# if you want to use a delimiter, you can try
# wrapper = wrapper.apply(lambda x: x.iloc[:, 0].str.cat(x.iloc[:, 1:]), sep=',')
ID kgp11274425 kgp5732633 kgp75 rs707
0 1 CT GT 00 CA
1 2 CC TG GT AA
2 3 AA GG AA CG
3 4 GG CC TA AA
The original Data looks like
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1 C T G T C A 0 0
2 C C T G A A G T
3 A A G G C G A A
4 G G C C A A T A
注:
- 像上面一样,我需要合并 522 行和 369 列(单独的母亲和父亲 SNP 值)
- 每个 SNP 的长度不同(例如:kgp11274425 和 rs707)
我正在处理 GWAS 数据,这些是我们包含一条母染色体和一条父染色体的细胞的 SNP id。我想将 M & F 的单个 SNP 值合并为一个并用 SNP ID 命名它 (kgp11274425_A + kgp11274425_HET = kgp11274425)
Desired Output:
ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA
Can Anyone please help me, all support & help needed
只需按以下方式更改 unique_cols
:
from io import StringIO
data = StringIO("""ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F,
1, C, T, G, T,
2, C, C, T, G,
3, A, A, G, G,
4, G, G, C, C""")
df = pd.read_csv(data, sep=",")
cols = ['ID', ' kgp11274425_M', ' kgp11274425_F', ' kgp5732633_M',
' kgp5732633_F']
df = df[cols]
df = df.set_index('ID')
#here######################################################
sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique()
#here######################################################
results = []
columns = []
for col in unique_cols:
my_cols = [x for x in df.columns if x.startswith(col)]
results.append(df[my_cols].sum(axis=1).values)
columns.append(col)
new_df = pd.DataFrame(results).T
new_df.columns = columns
一个选项是 pandas' groupby,然后是迭代:
temp = df.set_index('ID')
wrapper = temp.groupby(temp.columns.str.split('_').str[0], axis = 1)
wrapper.sum().reset_index()
# if you want to use a delimiter, you can try
# wrapper = wrapper.apply(lambda x: x.iloc[:, 0].str.cat(x.iloc[:, 1:]), sep=',')
ID kgp11274425 kgp5732633 kgp75 rs707
0 1 CT GT 00 CA
1 2 CC TG GT AA
2 3 AA GG AA CG
3 4 GG CC TA AA