创建预定义文本的字符向量列，并使用 rbind 或 bind_rows 将其绑定到现有数据框

Question

美好的一天，

我将提出两个 [likely] 非常微不足道的问题，供您进行精彩的评论。

问题#1

我有一个相对整洁的 df (dat)，暗淡的 10299 x 563。[创建的] dat 两个数据集共有的 563 个变量是 'subject'（数字)、'label'（数字）、3:563（来自文本文件的变量名）。观察 1:2947 来自 'test' 数据集，而观察 2948:10299 来自 'training' 数据集。

我想将一列 (header = 'type') 插入数据中，该列基本上是由字符串 test[=78= 组成的行 1:2947 ] 和字符串 train 的行 2948:10299 这样我可以稍后在 dplyr/tidyr.

中对数据集或其他类似聚合函数进行分组
我创建了一个测试 df (testdf = 1:10299: dim(testdf) = 102499 x 1) 然后:

testdat[1:2947 , "type"] <- c("test") testdat[2948:10299, "type"] <- c("train") > head(ds, 2);tail(ds, 2) X1.10299 type 1 1 test 2 2 test X1.10299 type 10298 10298 train 10299 10299 train

所以我真的不喜欢现在有一列 X1.10299。

问题：

是否有更好、更方便的方法来创建包含我在上面的用例中寻找的内容的列？

将该列实际插入 'dat' 以便我以后可以使用它与 dplyr 分组的好方法是什么？

问题#2

我从上面得到 [几乎] 整洁的 df (dat) 的方法是分别采用 dim(2947 x 563 和 7352 x 563) 形式的两个 dfs（测试和训练），并且 rbinding 他们在一起。

我确认我的所有变量名称在绑定后都存在，如下所示：

test.names <- names(test) train.names <- names(train) identical(test.names, train.names) > TRUE

有趣且主要关注的是，如果我尝试使用 'dplyr' 中的 bind_rows 函数来执行相同的绑定练习：

dat <- bind_rows(test, train)

它 returns 一个显然保留了我所有观察结果的数据框 (x: 10299) 但现在我的变量数从 563 减少到 470！

问题：

有谁知道为什么我的变量被砍掉了？

这是为了以后slicing/dicing与dplyr/
合并两个相同结构的dfs的最佳方式吗

打扫卫生？

感谢您抽出时间考虑这些问题。

示例 test/train dfs 以供审查（最左边的数字是 df 索引）：

测试 df 测试[1:10, 1:5]

subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z 1 2 5 0.2571778 -0.02328523 -0.01465376 2 2 5 0.2860267 -0.01316336 -0.11908252 3 2 5 0.2754848 -0.02605042 -0.11815167 4 2 5 0.2702982 -0.03261387 -0.11752018 5 2 5 0.2748330 -0.02784779 -0.12952716 6 2 5 0.2792199 -0.01862040 -0.11390197 7 2 5 0.2797459 -0.01827103 -0.10399988 8 2 5 0.2746005 -0.02503513 -0.11683085 9 2 5 0.2725287 -0.02095401 -0.11447249 10 2 5 0.2757457 -0.01037199 -0.09977589

火车 df 火车[1:10, 1:5]

subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z 1 1 5 0.2885845 -0.020294171 -0.1329051 2 1 5 0.2784188 -0.016410568 -0.1235202 3 1 5 0.2796531 -0.019467156 -0.1134617 4 1 5 0.2791739 -0.026200646 -0.1232826 5 1 5 0.2766288 -0.016569655 -0.1153619 6 1 5 0.2771988 -0.010097850 -0.1051373 7 1 5 0.2794539 -0.019640776 -0.1100221 8 1 5 0.2774325 -0.030488303 -0.1253604 9 1 5 0.2772934 -0.021750698 -0.1207508 10 1 5 0.2805857 -0.009960298 -0.1060652

实际代码（忽略函数 calls/I 正在通过控制台进行大部分测试）。

[http://archive.ics.uci.edu/ml/machine-learning-databases/00240/]The data set I'm using with this code. 1

run_analysis <- function () { #Vars available for use throughout the function that should be preserved vars <- read.table("features.txt", header = FALSE, sep = "") lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6), activity_label = c("walking", "walking_up", "walking_down", "sitting", "standing", "laying")) test <- test_read_process(vars, lookup_table) train <- train_read_process(vars, lookup_table) } test_read_process <- function(vars, lookup_table) { #read in the three documents for cbinding later test.sub <- read.table("test/subject_test.txt", header = FALSE) test.labels <- read.table("test/y_test.txt", header = FALSE) test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "") #cbind the cols together and set remaining colNames to var names in vars test.dat <- cbind(test.sub, test.labels, test.obs) colnames(test.dat) <- c("subject", "labels", as.character(vars[,2])) #Use lookup_table to set the "test_labels" string values that correspond #to their integer IDs #test.lookup <- merge(test, lookup_table, by.x = "labels", # by.y ="activitynum", all.x = T) #Remove temporary symbols from globalEnv/memory rm(test.sub, test.labels, test.obs) #return return(test.dat) } train_read_process <- function(vars, lookup_table) { #read in the three documents for cbinding train.sub <- read.table("train/subject_train.txt", header = FALSE) train.labels <- read.table("train/y_train.txt", header = FALSE) train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "") #cbind the cols together and set remaining colNames to var names in vars train.dat <- cbind(train.sub, train.labels, train.obs) colnames(train.dat) <- c("subject", "label", as.character(vars[,2])) #Clean up temporary symbols from globalEnv/memory rm(train.sub, train.labels, train.obs, vars) return(train.dat) }

Answer 1

您面临的问题源于您在用于创建数据框对象的变量列表中有重复的名称。如果您确保列名称是唯一的并且在对象之间共享，代码将运行。我已经包含了一个基于您上面使用的代码的完整工作示例（在评论中指出了修复和各种编辑）：

vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)

##  FRS: This is the source of original problem:
duplicated(vars[,2])
vars[317:340,2]
duplicated(vars[317:340,2])
vars[396:419,2]

##  FRS: I edited the following to both account for your data and variable
##    issues:
test_read_process <- function() {
  #read in the three documents for cbinding later
  test.sub <- read.table("test/subject_test.txt", header = FALSE)
  test.labels <- read.table("test/y_test.txt", header = FALSE)
  test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")

  #cbind the cols together and set remaining colNames to var names in vars
  test.dat <- cbind(test.sub, test.labels, test.obs)  
  #colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
  colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))

  return(test.dat)
}

train_read_process <- function() {
  #read in the three documents for cbinding
  train.sub <- read.table("train/subject_train.txt", header = FALSE)
  train.labels <- read.table("train/y_train.txt", header = FALSE)
  train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")

  #cbind the cols together and set remaining colNames to var names in vars
  train.dat <- cbind(train.sub, train.labels, train.obs)    
  #colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
  colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))

  return(train.dat)
}


test_df <- test_read_process()
train_df <- train_read_process()

identical(names(test_df), names(train_df))


library("dplyr")

## FRS: These could be piped together but I've kept them separate for clarity:
train_df %>%
  mutate(test="train") -> 
  train_df

test_df %>%
  mutate(test="test") -> 
  test_df

test_df %>% 
  bind_rows(train_df) -> 
  out_df

head(out_df)
out_df

##  FRS: You can set your column names to those of the original 
##    variable list but you still have duplicates to deal with:
names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")

duplicated(names(out_df))

创建预定义文本的字符向量列，并使用 rbind 或 bind_rows 将其绑定到现有数据框

Create a character vector column of predefined text and bind it to existing dataframe using rbind or bind_rows

r

dataframe

rbind

cbind

dplyr