是否有一种惯用的 R 方法来规范化数据帧？

Question

问题如下：我们有一个csv文件，数据格式有些异常。 R 很大，我肯定缺少一些简短的解决方案。

给定一个文件，我们读取它并获得以下形式的数据框：

# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03

是否有任何简短的方法可以将其转换为此数据帧：

id      file topic proportion
 0 file1.txt     0       0.01
 1 file2.txt     0       0.01
 1 file2.txt     1       0.03

我们在哪里有固定数量的列？主题比例对的数量没有定义，可以非常大。谢谢！

Answer 1

有一种方法可以继续。我想 data 包含保存为 .csv 文件的文件路径：

library(plyr)

df        = read.csv(data)
names     = c("id","file","topic","proportion")
extractDF = function(u) setNames(df[,c(1,2,u,u+1)], names)

newDF = ldply(seq(3,length(df)-1,by=2), extractDF)

newDF[complete.cases(newDF),]

#  id      file topic proportion
#1  0 file1.txt     0       0.01
#2  1 file2.txt     0       0.01
#4  1 file2.txt     1       0.03

数据如下，保存为csv格式：

# id, file, topic, proportion, [topic, proportion]* 
0,file1.txt,0,0.01 
1,file2.txt,0,0.01,1,0.03

Answer 2

你可以试试我的 "splitstackshape" 包中的 merged.stack。

假定这是您的起始数据....

mydf <- read.table(
  text = "id, file, topic, proportion, topic, proportion
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03", 
  header = TRUE, sep = ",", fill = TRUE) 
mydf
#   id      file topic proportion topic.1 proportion.1
# 1  0 file1.txt     0       0.01      NA           NA
# 2  1 file2.txt     0       0.01       1         0.03

你只需要做....

library(splitstackshape)
merged.stack(mydf, var.stubs = c("topic", "proportion"), 
             sep = "var.stubs")[, .time_1 := NULL][]
#    id      file topic proportion
# 1:  0 file1.txt     0       0.01
# 2:  0 file1.txt    NA         NA
# 3:  1 file2.txt     0       0.01
# 4:  1 file2.txt     1       0.03

如果您不想要其中包含 NA 值的行，请将整个内容包装在 na.omit 中。

na.omit(
  merged.stack(mydf, var.stubs = c("topic", "proportion"), 
               sep = "var.stubs")[, .time_1 := NULL])
#    id      file topic proportion
# 1:  0 file1.txt     0       0.01
# 2:  1 file2.txt     0       0.01
# 3:  1 file2.txt     1       0.03

是否有一种惯用的 R 方法来规范化数据帧？

Is there an idiomatic R way to normalize a dataframe?

r

dataframe