拆分和配对两列的值
Splitting and pairing the values of two columns
我有一个这样的数据框:
Entry name Gene names
A1BG_HUMAN A1BG
M0R009_HUMAN A1BG
F8W9F8_HUMAN A1CF
Q5T0W7_HUMAN A1CF
A1CF_HUMAN A1CF ACF ASP
H0YFH1_HUMAN A2M
A2MG_HUMAN A2M CPAMD5 FWP007
第一列是蛋白质名称,第二列是相关基因。在一些蛋白质前面可以看到多个基因名称,它们基本上是该细胞中第一个基因的别名(以 1 space 分隔)。
我想把这个数据集转换成一种形式,每个蛋白质名称都与不同的基因名称配对,这样我就有了这样的东西:
Entry name Gene names
A1BG_HUMAN A1BG
M0R009_HUMAN A1BG
F8W9F8_HUMAN A1CF
F8W9F8_HUMAN ACF
F8W9F8_HUMAN ASP
Q5T0W7_HUMAN A1CF
Q5T0W7_HUMAN ACF
Q5T0W7_HUMAN ASP
A1CF_HUMAN A1CF
A1CF_HUMAN ACF
A1CF_HUMAN ASP
H0YFH1_HUMAN A2M
H0YFH1_HUMAN CPAMD5
H0YFH1_HUMAN FWP007
A2MG_HUMAN A2M
A2MG_HUMAN CPAMD5
A2MG_HUMAN FWP007
我知道如何将具有多个条目的单元格分成不同的行,但我不确定如何将第一列中的蛋白质与基因的不同别名配对。
有人知道怎么做吗?
EDIT: I don't want to only split the data into different rows. So cSplit will not actually help me here. Let me bring an example:
In front of A1CF_HUMAN, different aliases of A1CF gene are brought (ACF & ASP). I want to not only pair A1CF_HUMAN with ACF & ASP, but also pair other proteins which are associated with A1CF gene (F8W9F8_HUMAN & Q5T0W7_HUMAN) with ACF & ASP as well. Please take a look at what I want above to better understand what I am exactly looking for. I don't think that it can be done by a single command.
假设第一个元素始终是'key',其余元素是别名,拆分基因名称,识别键,然后按键对所有别名进行分组,并将每个元素标准化以包含别名
elts = strsplit(df$Gene_names, " ")
keys = sapply(elts, "[[", 1)
values = split(unlist(elts), rep(keys, lengths(elts)))
df$Gene_names = lapply(values, unique)[keys]
使用每个标准化基因名称的长度来复制条目名称,并将它们与未列出的分裂基因名称相匹配
data.frame(
Entry_name = rep(df$Entry_name, lengths(df$Gene_names)),
Gene_name = unlist(df$Gene_names))
我们可以使用 cSplit
和 data.table
。我们将 'data.frame' 转换为 'data.table' (setDT(df)
),从 'Gene_names' 中提取第一个单词(stringr
中的 word(Gene_names,1)
也应该有效) =17=],用duplicated
得到一个逻辑索引,我们用逻辑向量的累加和作为分组变量,把"Gene_names"赋值给最长的字符。然后,使用cSplit
将其转换为'long'格式。
library(splitstackshape)
library(data.table)
setDT(df)[, Gene_names := Gene_names[which.max(nchar(Gene_names))] ,
cumsum(!duplicated(sub("\s+.*", "", Gene_names)))][]
cSplit(df, "Gene_names", " ", "long")
# Entry_name Gene_names
# 1: A1BG_HUMAN A1BG
# 2: M0R009_HUMAN A1BG
# 3: F8W9F8_HUMAN A1CF
# 4: F8W9F8_HUMAN ACF
# 5: F8W9F8_HUMAN ASP
# 6: Q5T0W7_HUMAN A1CF
# 7: Q5T0W7_HUMAN ACF
# 8: Q5T0W7_HUMAN ASP
# 9: A1CF_HUMAN A1CF
#10: A1CF_HUMAN ACF
#11: A1CF_HUMAN ASP
#12: H0YFH1_HUMAN A2M
#13: H0YFH1_HUMAN CPAMD5
#14: H0YFH1_HUMAN FWP007
#15: A2MG_HUMAN A2M
#16: A2MG_HUMAN CPAMD5
#17: A2MG_HUMAN FWP007
数据
df <- structure(list(Entry_name = c("A1BG_HUMAN", "M0R009_HUMAN",
"F8W9F8_HUMAN",
"Q5T0W7_HUMAN", "A1CF_HUMAN", "H0YFH1_HUMAN", "A2MG_HUMAN"),
Gene_names = c("A1BG", "A1BG", "A1CF ACF ASP", "A1CF ACF ASP",
"A1CF ACF ASP", "A2M CPAMD5 FWP007", "A2M CPAMD5 FWP007")),
.Names = c("Entry_name",
"Gene_names"), class = "data.frame", row.names = c(NA, -7L))
我有一个这样的数据框:
Entry name Gene names
A1BG_HUMAN A1BG
M0R009_HUMAN A1BG
F8W9F8_HUMAN A1CF
Q5T0W7_HUMAN A1CF
A1CF_HUMAN A1CF ACF ASP
H0YFH1_HUMAN A2M
A2MG_HUMAN A2M CPAMD5 FWP007
第一列是蛋白质名称,第二列是相关基因。在一些蛋白质前面可以看到多个基因名称,它们基本上是该细胞中第一个基因的别名(以 1 space 分隔)。
我想把这个数据集转换成一种形式,每个蛋白质名称都与不同的基因名称配对,这样我就有了这样的东西:
Entry name Gene names
A1BG_HUMAN A1BG
M0R009_HUMAN A1BG
F8W9F8_HUMAN A1CF
F8W9F8_HUMAN ACF
F8W9F8_HUMAN ASP
Q5T0W7_HUMAN A1CF
Q5T0W7_HUMAN ACF
Q5T0W7_HUMAN ASP
A1CF_HUMAN A1CF
A1CF_HUMAN ACF
A1CF_HUMAN ASP
H0YFH1_HUMAN A2M
H0YFH1_HUMAN CPAMD5
H0YFH1_HUMAN FWP007
A2MG_HUMAN A2M
A2MG_HUMAN CPAMD5
A2MG_HUMAN FWP007
我知道如何将具有多个条目的单元格分成不同的行,但我不确定如何将第一列中的蛋白质与基因的不同别名配对。
有人知道怎么做吗?
EDIT: I don't want to only split the data into different rows. So cSplit will not actually help me here. Let me bring an example:
In front of A1CF_HUMAN, different aliases of A1CF gene are brought (ACF & ASP). I want to not only pair A1CF_HUMAN with ACF & ASP, but also pair other proteins which are associated with A1CF gene (F8W9F8_HUMAN & Q5T0W7_HUMAN) with ACF & ASP as well. Please take a look at what I want above to better understand what I am exactly looking for. I don't think that it can be done by a single command.
假设第一个元素始终是'key',其余元素是别名,拆分基因名称,识别键,然后按键对所有别名进行分组,并将每个元素标准化以包含别名
elts = strsplit(df$Gene_names, " ")
keys = sapply(elts, "[[", 1)
values = split(unlist(elts), rep(keys, lengths(elts)))
df$Gene_names = lapply(values, unique)[keys]
使用每个标准化基因名称的长度来复制条目名称,并将它们与未列出的分裂基因名称相匹配
data.frame(
Entry_name = rep(df$Entry_name, lengths(df$Gene_names)),
Gene_name = unlist(df$Gene_names))
我们可以使用 cSplit
和 data.table
。我们将 'data.frame' 转换为 'data.table' (setDT(df)
),从 'Gene_names' 中提取第一个单词(stringr
中的 word(Gene_names,1)
也应该有效) =17=],用duplicated
得到一个逻辑索引,我们用逻辑向量的累加和作为分组变量,把"Gene_names"赋值给最长的字符。然后,使用cSplit
将其转换为'long'格式。
library(splitstackshape)
library(data.table)
setDT(df)[, Gene_names := Gene_names[which.max(nchar(Gene_names))] ,
cumsum(!duplicated(sub("\s+.*", "", Gene_names)))][]
cSplit(df, "Gene_names", " ", "long")
# Entry_name Gene_names
# 1: A1BG_HUMAN A1BG
# 2: M0R009_HUMAN A1BG
# 3: F8W9F8_HUMAN A1CF
# 4: F8W9F8_HUMAN ACF
# 5: F8W9F8_HUMAN ASP
# 6: Q5T0W7_HUMAN A1CF
# 7: Q5T0W7_HUMAN ACF
# 8: Q5T0W7_HUMAN ASP
# 9: A1CF_HUMAN A1CF
#10: A1CF_HUMAN ACF
#11: A1CF_HUMAN ASP
#12: H0YFH1_HUMAN A2M
#13: H0YFH1_HUMAN CPAMD5
#14: H0YFH1_HUMAN FWP007
#15: A2MG_HUMAN A2M
#16: A2MG_HUMAN CPAMD5
#17: A2MG_HUMAN FWP007
数据
df <- structure(list(Entry_name = c("A1BG_HUMAN", "M0R009_HUMAN",
"F8W9F8_HUMAN",
"Q5T0W7_HUMAN", "A1CF_HUMAN", "H0YFH1_HUMAN", "A2MG_HUMAN"),
Gene_names = c("A1BG", "A1BG", "A1CF ACF ASP", "A1CF ACF ASP",
"A1CF ACF ASP", "A2M CPAMD5 FWP007", "A2M CPAMD5 FWP007")),
.Names = c("Entry_name",
"Gene_names"), class = "data.frame", row.names = c(NA, -7L))