计算小鼠基因的旁系同源物数量给我错误的频率
Counting the number of paralogues for mouse genes gives me the wrong frequency
我正在尝试使用 BioMart 计算人类蛋白质编码基因的小鼠同源物的旁系同源物数量。但是例如在 'PLIN4' 基因中,它计算出 35,000 个旁系同源物而不是 4 个。
我们认为这是因为某些基因具有一对多的旁系同源物而导致重复。当我 运行 单个基因时,它会返回正确数量的旁系同源物。有没有办法从结果中删除这些重复或绕过这个问题,以便 BioMart 不输出这些重复。
我也曾考虑过 运行 一次只包含一个基因,然后通过设置某种循环来计算它,以便它自动处理列表中的所有基因。
目前我写的代码是:
# Load the biomaRt package:
library(biomaRt)
ensembl_hsapiens <- useMart("ensembl",
dataset = "hsapiens_gene_ensembl")
ensembl_mouse <- useMart("ensembl",
dataset = "mmusculus_gene_ensembl")
# Get all human protein coding genes:
hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id",
"external_gene_name"),
filters = "biotype",
values = "protein_coding",
mart = ensembl_hsapiens)
ensembl_gene_ID <- hsapien_PC_genes$ensembl_gene_id
# Get mouse homologues
mouse_homologues <- getBM(attributes = c("ensembl_gene_id", "external_gene_name",
"mmusculus_homolog_associated_gene_name"),
filters = "ensembl_gene_id",
values = c(ensembl_gene_ID),
mart = ensembl_hsapiens)
# Get mouse external gene name
mouse_homologues_external_gene_names <- mouse_homologues$mmusculus_homolog_associated_gene_name
mouse_paralogues <- getBM(attributes = c("hsapiens_homolog_associated_gene_name",
"external_gene_name",
"mmusculus_paralog_associated_gene_name"),
filters = "external_gene_name",
values = c(mouse_homologues_external_gene_names) , mart = ensembl_mouse)
# Remove genes with no paralogues
mouse_paralogs_data <- mouse_paralogues[!(is.na(mouse_paralogues$mmusculus_paralog_associated_gene_name)
|
mouse_paralogues$mmusculus_paralog_associated_gene_name==""), ]
# Count paralogues per gene
library(plyr)
count_mouse_paralogues <- count(mouse_paralogs_data, "external_gene_name")
View(count_mouse_paralogues)
希望有人能帮忙
谢谢
杰克
除了评论之外,这是我得到的:
# R version 3.4.2 (2017-09-28)
library(biomaRt) # biomaRt_2.32.1
library(dplyr) # dplyr_0.7.4
# test how many paralogues for gene PLIN4
nrow(mouse_paralogs_data[
mouse_paralogs_data$hsapiens_homolog_associated_gene_name == "PLIN4", ])
# [1] 4
# now summarise for all genes
res <- mouse_paralogs_data %>%
group_by(hsapiens_homolog_associated_gene_name) %>%
summarise(DistinctP = n_distinct(mmusculus_paralog_associated_gene_name))
# test again number of paralogues for gene PLIN4
res[ res$hsapiens_homolog_associated_gene_name == "PLIN4", ]
# # A tibble: 1 x 2
# hsapiens_homolog_associated_gene_name DistinctP
# <chr> <int>
# 1 PLIN4 4
我正在尝试使用 BioMart 计算人类蛋白质编码基因的小鼠同源物的旁系同源物数量。但是例如在 'PLIN4' 基因中,它计算出 35,000 个旁系同源物而不是 4 个。
我们认为这是因为某些基因具有一对多的旁系同源物而导致重复。当我 运行 单个基因时,它会返回正确数量的旁系同源物。有没有办法从结果中删除这些重复或绕过这个问题,以便 BioMart 不输出这些重复。
我也曾考虑过 运行 一次只包含一个基因,然后通过设置某种循环来计算它,以便它自动处理列表中的所有基因。
目前我写的代码是:
# Load the biomaRt package:
library(biomaRt)
ensembl_hsapiens <- useMart("ensembl",
dataset = "hsapiens_gene_ensembl")
ensembl_mouse <- useMart("ensembl",
dataset = "mmusculus_gene_ensembl")
# Get all human protein coding genes:
hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id",
"external_gene_name"),
filters = "biotype",
values = "protein_coding",
mart = ensembl_hsapiens)
ensembl_gene_ID <- hsapien_PC_genes$ensembl_gene_id
# Get mouse homologues
mouse_homologues <- getBM(attributes = c("ensembl_gene_id", "external_gene_name",
"mmusculus_homolog_associated_gene_name"),
filters = "ensembl_gene_id",
values = c(ensembl_gene_ID),
mart = ensembl_hsapiens)
# Get mouse external gene name
mouse_homologues_external_gene_names <- mouse_homologues$mmusculus_homolog_associated_gene_name
mouse_paralogues <- getBM(attributes = c("hsapiens_homolog_associated_gene_name",
"external_gene_name",
"mmusculus_paralog_associated_gene_name"),
filters = "external_gene_name",
values = c(mouse_homologues_external_gene_names) , mart = ensembl_mouse)
# Remove genes with no paralogues
mouse_paralogs_data <- mouse_paralogues[!(is.na(mouse_paralogues$mmusculus_paralog_associated_gene_name)
|
mouse_paralogues$mmusculus_paralog_associated_gene_name==""), ]
# Count paralogues per gene
library(plyr)
count_mouse_paralogues <- count(mouse_paralogs_data, "external_gene_name")
View(count_mouse_paralogues)
希望有人能帮忙
谢谢
杰克
除了评论之外,这是我得到的:
# R version 3.4.2 (2017-09-28)
library(biomaRt) # biomaRt_2.32.1
library(dplyr) # dplyr_0.7.4
# test how many paralogues for gene PLIN4
nrow(mouse_paralogs_data[
mouse_paralogs_data$hsapiens_homolog_associated_gene_name == "PLIN4", ])
# [1] 4
# now summarise for all genes
res <- mouse_paralogs_data %>%
group_by(hsapiens_homolog_associated_gene_name) %>%
summarise(DistinctP = n_distinct(mmusculus_paralog_associated_gene_name))
# test again number of paralogues for gene PLIN4
res[ res$hsapiens_homolog_associated_gene_name == "PLIN4", ]
# # A tibble: 1 x 2
# hsapiens_homolog_associated_gene_name DistinctP
# <chr> <int>
# 1 PLIN4 4