如何在导入的 csv 文件上保留层次聚类的行名称

Question

我想对从 .csv 文件导入 R 的数据进行层次聚类分析。我无法保留行名称的第一列，所以我的树状图提示最终没有名称，这是无用的用于下游分析和与元数据的链接。

当我导入 .csv 文件时，如果我为 dist 函数使用包含行名称第一列的数据框，我会收到警告： “警告信息：在 dist(as.matrix(df)) 中：由强制引入的 NA”。我发现以前的 Stack Overflow 问题解决了这个问题： "NAs introduced by coercion" during Cluster Analysis in R 提供的解决方案是删除行名称。但这也从生成的距离矩阵中删除了提示标签，我需要它来理解树状图并链接到下游的元数据（例如，为树状图提示添加颜色或基于其他变量的热图）。

# Generate dataframe with example numbers
Samples <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E')
Variable_A <- c(0, 1, 1, 0, 1)
Variable_B <- c(0, 1, 1, 0, 1)
Variable_C <- c(0, 0, 1, 1, 1)
Variable_D <- c(0, 0, 1, 1, 0)
Variable_E <- c(0, 0, 1, 1, 0)
df = data.frame(Samples, Variable_A, Variable_B, Variable_C, Variable_D, Variable_E, row.names=c(1))
df
# generate distance matrix
d <- dist(as.matrix(df))
# apply hirarchical clustering 
hc <- hclust(d)
# plot dendrogram
plot(hc)

一切正常。但是假设我想从文件中导入我的真实数据...

# writing the example dataframe to file
write.csv(df, file = "mock_df.csv")

# importing a file
df_import <- read.csv('mock_df.csv', header=TRUE)

我不再使用与上面相同的代码获取原始行名称：

# generating distance matrix for imported file
d2 <- dist(as.matrix(df_import))
# apply hirarchical clustering 
hc2 <- hclust(d2)
# plot dendrogram
plot(hc2)

在 R 中创建的 df 一切正常，但我丢失了导入数据的行名。我该如何解决？

Answer 1

Samples <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E')
Variable_A <- c(0, 1, 1, 0, 1)
Variable_B <- c(0, 1, 1, 0, 1)
Variable_C <- c(0, 0, 1, 1, 1)
Variable_D <- c(0, 0, 1, 1, 0)
Variable_E <- c(0, 0, 1, 1, 0)
df = data.frame(Samples, Variable_A, Variable_B, Variable_C, Variable_D, Variable_E, row.names=c(1))
df
d <- dist(as.matrix(df))
hc <- hclust(d)
plot(hc)
df
write.csv(df, file = "mock_df.csv",row.names = TRUE)
df_import <- read.table('mock_df.csv', header=TRUE,row.names=1,sep=",")
d2 <- dist(as.matrix(df_import))
hc2 <- hclust(d2)
plot(hc2)

换句话说，使用 read.table 而不是 read.csv

df_import <- read.table('mock_df.csv', header=TRUE,row.names=1,sep=",")

如何在导入的 csv 文件上保留层次聚类的行名称

How to keep row names for hirarchical clustering on imported csv files

r

hierarchical-clustering

distance-matrix