使用 R 更改具有 0 和 1 信息的矩阵中的数据帧
Change a dataframe in matrix with 0 and 1 informations with R
我有一个数据框,例如:
Cluster sequence_name
1 species1
1 species1
1 species2
1 species3
1 species3
1 gene1
1 gene2
2 species4
2 species5
2 spciess5
2 species3
2 gene3
2 gene4
我想得到一个矩阵,例如:
gene1 gene2 gene3 gene4
species5 0 0 1 1
species4 0 0 1 1
species1 1 1 0 0
species2 1 1 0 0
species3 1 1 1 1
其中 1
表示对于 speciesX
基因存在,0
表示不存在。
Present 表示 speciesX
在 same cluster
中出现,而不是 geneX
。例如,gene1 在 cluster1
中作为 species1, 2 and 3
存在。
相反,species5 and 4
不存在于 cluster1
.
中
如你所见;有几个重复(在同一个集群中,一个物种可以代表多次)。
谢谢你的帮助。
真实数据是这样的:
cluster_names seq_names
1 AP_000401.1
1 NP_039001.1
1 Canis_lupus
1 Canis_familiaris
2 YP_0090909.1
2 Mustela_putorius
2 Mustela_furo
2 YP_0909200.1
..
...
AP和NP等XX字母是基因
Genus_specie 物种
回复丹尼斯:
下面是头部的真实数据:
cluster_names seq_names
1 scf7180005155889:2745-3053(-):Drosophia_melanogaster
1 IDBA_scaffold_72878:85-225:292707-293006(+):Orussu_sp
1 scaffold_3615:40850-41320(-):Canis_lupus
1 scaffold_8697:754-1209(-):homo_sapiens
1 scf7180005155889:72-1908(-):homo_sapiens
1 YP_003969716.1
1 NP_003986717.1
2 scaffold_17536:2745-3053(-):Drosophia_melanogaster
2 scf7180005155889:2000-8900(-):Drosophia_melanogaster
2 scaffold_8697:754-1209(-):homo_sapiens
2 YP_003956764.1
2 YP_004894416.1
2 YP_008958968.1
我应该得到的输出是:
回复丹尼斯:
> df <- read.table(text = "Cluster sequence_name
+ 1 :Drosophia_melanogaster
+ 1 scf7180005155889:2745-3053(-):Drosophila_melanogaster
+ 1 scf7180005155889:2745-3053(-):Orussu_sp
+ 1 scf7180005155889:2745-3053(-):Canis_lupus
+ 1 scf7180005155889:72-1908(-):Homo_sapiens
+ 1 scf7180005155889:2745-3053(-):Homo_sapiens
+ 1 YP_003970075.1
+ 1 YP_005070075.1
+ 2 scf7180005155889:72-1908(-):Drosophila_melanogaster
+ 2 scf7180005155889:72-1908(-):Drosophila_melanogaster
+ 2 scf7180005155889:72-1908(-):Homo_sapiens
+ 2 YP_039970075.1
+ 2 NP_003900075.1",header = T)
> df <- setDT(df)
> species <- df[grep("[0-9]+\([+-]\):[A-z ]+",sequence_name)]
> species[,sequence_name := str_extract(sequence_name,"(?<=[0-9]\([+-]\):)[A-z ]+")]
> genes <- df[grep("[0-9]+\.1",sequence_name)]
> genes[,sequence_name :=sequence_name]
> plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
> result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
Using 'sequence_name.y' as value column. Use 'value.var' to override
> row.names(result)<-result$sequence_name.y
> result$sequence_name.y<- NULL
> result
NP_003900075.1 YP_003970075.1 YP_005070075.1 YP_039970075.1
1: 0 1 1 0
2: 2 1 1 2
3: 1 2 2 1
4: 0 1 1 0
library(data.table)
library(stringr)
df <- setDT(df)
这里我用data.table
。所以我们的想法是创建两个数据框,一个是基因,一个是物种
species <- df[grep("species",sequence_name)]
species[,sequence_name := str_extract(sequence_name,"(?<=:)[a-z0-9]+$")]
genes <- df[grep("gene",sequence_name)]
> species
Cluster sequence_name
1: 1 species1
2: 1 species2
3: 1 species3
4: 2 species4
5: 2 species5
6: 2 species3
> genes
Cluster sequence_name
1: 1 gene1
2: 1 gene2
3: 2 gene3
4: 2 gene4
您想通过 allow.cartesian=TRUE
将它们按簇合并在一起,因为您的合并向量不是 data.frame 的 none 的单个标识符:
plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
Cluster sequence_name.x sequence_name.y
1: 1 gene1 species1
2: 1 gene1 species2
3: 1 gene1 species3
4: 1 gene2 species1
5: 1 gene2 species2
6: 1 gene2 species3
7: 2 gene3 species4
8: 2 gene3 species5
9: 2 gene3 species3
10: 2 gene4 species4
11: 2 gene4 species5
12: 2 gene4 species3
然后,获取结果只是在计算出现次数的同时使用宽格式,您可以在此处使用 dcast
来完成:
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
sequence_name.y gene1 gene2 gene3 gene4
1: species1 1 1 0 0
2: species2 1 1 0 0
3: species3 1 1 1 1
4: species4 0 0 1 1
5: species5 0 0 1 1
等等。我让 dplyr
有经验的用户提出 equivalent/improved 解决方案 dplyr
.
数据:
df <- read.table(text = "Cluster sequence_name
1 Scaffold_1:species1
1 Scaffold_2:species2
1 Scaffold_3:species3
1 gene1
1 gene2
2 Scaffold_4:species4
2 Scaffold_5:species5
2 Scaffold_6:species3
2 gene3
2 gene4",header = T)
根据您展示的真实数据:
df <- read.table(text ="cluster_names seq_names
1 scf7180005155889:2745-3053(-):Drosophia_melanogaster
1 scaffold_2484:292707-293006(+):Orussu_sp
1 scaffold_3615:40850-41320(-):Canis_lupus
1 scaffold_8697:754-1209(-):homo_sapiens
1 scf7180005155889:72-1908(-):homo_sapiens
1 YP_003969716.1
1 NP_003986717.1
2 scaffold_17536:2745-3053(-):Drosophia_melanogaster
2 scf7180005155889:2000-8900(-):Drosophia_melanogaster
2 scaffold_8697:754-1209(-):homo_sapiens
2 YP_003956764.1
2 YP_004894416.1
2 YP_008958968.1",header = T)
您应该将创建两个数据 table 的步骤更改为:
species <- df[grep("[0-9]+\([+-]\):[A-z ]+",seq_names)]
species[,sequence_name := str_extract(seq_names,"(?<=[0-9]\([+-]\):)[A-z ]+")]
genes <- df[grep("[0-9]+\.1",seq_names)]
genes[,sequence_name :=seq_names]
这里"[0-9]+\.1"
假设所有基因都以1结尾,物种描述没有意义。要提取物种信息,我想它总是在数字后包含 (+):
或 (-)+
。
但这是一个正则表达式问题,如果您有问题,应该是另一个问题的问题。您在这里的问题是找到调整数据以获得结果的方法。我通过为您提供处理示例数据的步骤来回答:使用正则表达式创建两个基因和物种数据框,合并它们并重新塑造它们。
其余作品:
plouf <- merge(genes,species,by = "cluster_names",allow.cartesian=TRUE)
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
使用 tidyverse:
# data
df1 <- read.table(text = "Cluster sequence_name
1 species1
1 species1
1 species2
1 species3
1 species3
1 gene1
1 gene2
2 species4
2 species5
2 species5
2 species3
2 gene3
2 gene4", header = TRUE, stringsAsFactors = FALSE)
# so that we know which row is species
species <- paste("species", 1:5, sep = "")
#[1] "species1" "species2" "species3" "species4" "species5"
library(tidyverse)
res <- reduce(split(df1, df1$sequence_name %in% species), left_join, by = "Cluster") %>%
unique() %>%
spread(key = "sequence_name.x", value = "Cluster") %>%
mutate_if(is.numeric, funs(as.numeric(!is.na(.))))
res
# sequence_name.y gene1 gene2 gene3 gene4
# 1 species1 1 1 0 0
# 2 species2 1 1 0 0
# 3 species3 1 1 1 1
# 4 species4 0 0 1 1
# 5 species5 0 0 1 1
我有一个数据框,例如:
Cluster sequence_name
1 species1
1 species1
1 species2
1 species3
1 species3
1 gene1
1 gene2
2 species4
2 species5
2 spciess5
2 species3
2 gene3
2 gene4
我想得到一个矩阵,例如:
gene1 gene2 gene3 gene4
species5 0 0 1 1
species4 0 0 1 1
species1 1 1 0 0
species2 1 1 0 0
species3 1 1 1 1
其中 1
表示对于 speciesX
基因存在,0
表示不存在。
Present 表示 speciesX
在 same cluster
中出现,而不是 geneX
。例如,gene1 在 cluster1
中作为 species1, 2 and 3
存在。
相反,species5 and 4
不存在于 cluster1
.
如你所见;有几个重复(在同一个集群中,一个物种可以代表多次)。 谢谢你的帮助。
真实数据是这样的:
cluster_names seq_names
1 AP_000401.1
1 NP_039001.1
1 Canis_lupus
1 Canis_familiaris
2 YP_0090909.1
2 Mustela_putorius
2 Mustela_furo
2 YP_0909200.1
..
...
AP和NP等XX字母是基因 Genus_specie 物种
回复丹尼斯:
下面是头部的真实数据:
cluster_names seq_names
1 scf7180005155889:2745-3053(-):Drosophia_melanogaster
1 IDBA_scaffold_72878:85-225:292707-293006(+):Orussu_sp
1 scaffold_3615:40850-41320(-):Canis_lupus
1 scaffold_8697:754-1209(-):homo_sapiens
1 scf7180005155889:72-1908(-):homo_sapiens
1 YP_003969716.1
1 NP_003986717.1
2 scaffold_17536:2745-3053(-):Drosophia_melanogaster
2 scf7180005155889:2000-8900(-):Drosophia_melanogaster
2 scaffold_8697:754-1209(-):homo_sapiens
2 YP_003956764.1
2 YP_004894416.1
2 YP_008958968.1
我应该得到的输出是:
回复丹尼斯:
> df <- read.table(text = "Cluster sequence_name
+ 1 :Drosophia_melanogaster
+ 1 scf7180005155889:2745-3053(-):Drosophila_melanogaster
+ 1 scf7180005155889:2745-3053(-):Orussu_sp
+ 1 scf7180005155889:2745-3053(-):Canis_lupus
+ 1 scf7180005155889:72-1908(-):Homo_sapiens
+ 1 scf7180005155889:2745-3053(-):Homo_sapiens
+ 1 YP_003970075.1
+ 1 YP_005070075.1
+ 2 scf7180005155889:72-1908(-):Drosophila_melanogaster
+ 2 scf7180005155889:72-1908(-):Drosophila_melanogaster
+ 2 scf7180005155889:72-1908(-):Homo_sapiens
+ 2 YP_039970075.1
+ 2 NP_003900075.1",header = T)
> df <- setDT(df)
> species <- df[grep("[0-9]+\([+-]\):[A-z ]+",sequence_name)]
> species[,sequence_name := str_extract(sequence_name,"(?<=[0-9]\([+-]\):)[A-z ]+")]
> genes <- df[grep("[0-9]+\.1",sequence_name)]
> genes[,sequence_name :=sequence_name]
> plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
> result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
Using 'sequence_name.y' as value column. Use 'value.var' to override
> row.names(result)<-result$sequence_name.y
> result$sequence_name.y<- NULL
> result
NP_003900075.1 YP_003970075.1 YP_005070075.1 YP_039970075.1
1: 0 1 1 0
2: 2 1 1 2
3: 1 2 2 1
4: 0 1 1 0
library(data.table)
library(stringr)
df <- setDT(df)
这里我用data.table
。所以我们的想法是创建两个数据框,一个是基因,一个是物种
species <- df[grep("species",sequence_name)]
species[,sequence_name := str_extract(sequence_name,"(?<=:)[a-z0-9]+$")]
genes <- df[grep("gene",sequence_name)]
> species
Cluster sequence_name
1: 1 species1
2: 1 species2
3: 1 species3
4: 2 species4
5: 2 species5
6: 2 species3
> genes
Cluster sequence_name
1: 1 gene1
2: 1 gene2
3: 2 gene3
4: 2 gene4
您想通过 allow.cartesian=TRUE
将它们按簇合并在一起,因为您的合并向量不是 data.frame 的 none 的单个标识符:
plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
Cluster sequence_name.x sequence_name.y
1: 1 gene1 species1
2: 1 gene1 species2
3: 1 gene1 species3
4: 1 gene2 species1
5: 1 gene2 species2
6: 1 gene2 species3
7: 2 gene3 species4
8: 2 gene3 species5
9: 2 gene3 species3
10: 2 gene4 species4
11: 2 gene4 species5
12: 2 gene4 species3
然后,获取结果只是在计算出现次数的同时使用宽格式,您可以在此处使用 dcast
来完成:
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
sequence_name.y gene1 gene2 gene3 gene4
1: species1 1 1 0 0
2: species2 1 1 0 0
3: species3 1 1 1 1
4: species4 0 0 1 1
5: species5 0 0 1 1
等等。我让 dplyr
有经验的用户提出 equivalent/improved 解决方案 dplyr
.
数据:
df <- read.table(text = "Cluster sequence_name
1 Scaffold_1:species1
1 Scaffold_2:species2
1 Scaffold_3:species3
1 gene1
1 gene2
2 Scaffold_4:species4
2 Scaffold_5:species5
2 Scaffold_6:species3
2 gene3
2 gene4",header = T)
根据您展示的真实数据:
df <- read.table(text ="cluster_names seq_names
1 scf7180005155889:2745-3053(-):Drosophia_melanogaster
1 scaffold_2484:292707-293006(+):Orussu_sp
1 scaffold_3615:40850-41320(-):Canis_lupus
1 scaffold_8697:754-1209(-):homo_sapiens
1 scf7180005155889:72-1908(-):homo_sapiens
1 YP_003969716.1
1 NP_003986717.1
2 scaffold_17536:2745-3053(-):Drosophia_melanogaster
2 scf7180005155889:2000-8900(-):Drosophia_melanogaster
2 scaffold_8697:754-1209(-):homo_sapiens
2 YP_003956764.1
2 YP_004894416.1
2 YP_008958968.1",header = T)
您应该将创建两个数据 table 的步骤更改为:
species <- df[grep("[0-9]+\([+-]\):[A-z ]+",seq_names)]
species[,sequence_name := str_extract(seq_names,"(?<=[0-9]\([+-]\):)[A-z ]+")]
genes <- df[grep("[0-9]+\.1",seq_names)]
genes[,sequence_name :=seq_names]
这里"[0-9]+\.1"
假设所有基因都以1结尾,物种描述没有意义。要提取物种信息,我想它总是在数字后包含 (+):
或 (-)+
。
但这是一个正则表达式问题,如果您有问题,应该是另一个问题的问题。您在这里的问题是找到调整数据以获得结果的方法。我通过为您提供处理示例数据的步骤来回答:使用正则表达式创建两个基因和物种数据框,合并它们并重新塑造它们。
其余作品:
plouf <- merge(genes,species,by = "cluster_names",allow.cartesian=TRUE)
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
使用 tidyverse:
# data
df1 <- read.table(text = "Cluster sequence_name
1 species1
1 species1
1 species2
1 species3
1 species3
1 gene1
1 gene2
2 species4
2 species5
2 species5
2 species3
2 gene3
2 gene4", header = TRUE, stringsAsFactors = FALSE)
# so that we know which row is species
species <- paste("species", 1:5, sep = "")
#[1] "species1" "species2" "species3" "species4" "species5"
library(tidyverse)
res <- reduce(split(df1, df1$sequence_name %in% species), left_join, by = "Cluster") %>%
unique() %>%
spread(key = "sequence_name.x", value = "Cluster") %>%
mutate_if(is.numeric, funs(as.numeric(!is.na(.))))
res
# sequence_name.y gene1 gene2 gene3 gene4
# 1 species1 1 1 0 0
# 2 species2 1 1 0 0
# 3 species3 1 1 1 1
# 4 species4 0 0 1 1
# 5 species5 0 0 1 1