如何对 R 中的第一列（行名）进行子集化

Question

我有多个样本中基因表达的 xy 数据。我希望对第一列进行子集化，以便我可以按字母顺序对基因进行排序并执行一些其他过滤。

> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);

Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).

第一列是基因名称列表（A2M、AAAS、AACS 等），每一列是不同的组织样本，从而显示每个组织样本中的基因表达。

被问到的问题是"Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"

我的想法是对第一列（基因名称）进行子集化，然后执行 order() 按字母顺序排序，之后我可以使用 head() 打印前 20 个。

然而当我尝试

> genes <- df[1]

它只是对包含数据的第一列 (TCGA-A6-2672_TissueA) 进行子集化，而不是对左侧的进行子集化。

还有

> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)

似乎在 R studio 的查看器中创建了一个基因名称列表，但我无法对其执行任何操作。

我无法正确定位 data.frame 中的第一列，因为它没有 header 列，而且我在对第 1 行做同样的事情时也遇到了同样的问题（示例名称）。

我是 R 的完全新手，这是我正在处理的作业的一部分，似乎我遗漏了一些基本的东西，但我不知道是什么。

伙计们干杯

Answer 1

如果你问的是我认为你在问的问题，你只需要在 as.data.frame 函数中进行子集化，它将自动生成一个 "header"，正如你所说的。它将被称为 V1，新数据框的第一个变量。

genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B

根据下面的评论，如果您从子集语法中删除逗号，则可以避免该问题。当您 data.frame 中的 select 列时，您只需要索引列，而不是行。

genes <- df[1]

Answer 2

请将您的文本文件示例包含为文本而不是图像。

我创建了一个与您的类似的数据集：

    X   Y
1   a   b
2   c   d
3   d   g

请注意，您的组织柱有 header 但您的基因名称没有。因此这些将 解释为行名 ，参见 ?read.table:

If row.names is not specified and the header line has one less entry than the number of columns, the first column is taken to be the row names.

在 R 中阅读：

df <- read.table(text = '   X   Y
1   a   b
2   c   d
3   d   g')

所以你的基因名称不在 df[1] 而是在 rownames(df)，所以要得到这些 genes <- rownames(df) 或将它们添加到现有的 df 你可以使用 df$gene <- rownames(df)

有多种方法可以将行名转换为列名，例如参见 [=18=]。

How to subset the first column (rownames) in R