是否有一种惯用的 R 方法来规范化数据帧?
Is there an idiomatic R way to normalize a dataframe?
问题如下:我们有一个csv文件,数据格式有些异常。 R 很大,我肯定缺少一些简短的解决方案。
给定一个文件,我们读取它并获得以下形式的数据框:
# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03
是否有任何简短的方法可以将其转换为此数据帧:
id file topic proportion
0 file1.txt 0 0.01
1 file2.txt 0 0.01
1 file2.txt 1 0.03
我们在哪里有固定数量的列?主题比例对的数量没有定义,可以非常大。谢谢!
有一种方法可以继续。我想 data
包含保存为 .csv
文件的文件路径:
library(plyr)
df = read.csv(data)
names = c("id","file","topic","proportion")
extractDF = function(u) setNames(df[,c(1,2,u,u+1)], names)
newDF = ldply(seq(3,length(df)-1,by=2), extractDF)
newDF[complete.cases(newDF),]
# id file topic proportion
#1 0 file1.txt 0 0.01
#2 1 file2.txt 0 0.01
#4 1 file2.txt 1 0.03
数据如下,保存为csv
格式:
# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03
你可以试试我的 "splitstackshape" 包中的 merged.stack
。
假定这是您的起始数据....
mydf <- read.table(
text = "id, file, topic, proportion, topic, proportion
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03",
header = TRUE, sep = ",", fill = TRUE)
mydf
# id file topic proportion topic.1 proportion.1
# 1 0 file1.txt 0 0.01 NA NA
# 2 1 file2.txt 0 0.01 1 0.03
你只需要做....
library(splitstackshape)
merged.stack(mydf, var.stubs = c("topic", "proportion"),
sep = "var.stubs")[, .time_1 := NULL][]
# id file topic proportion
# 1: 0 file1.txt 0 0.01
# 2: 0 file1.txt NA NA
# 3: 1 file2.txt 0 0.01
# 4: 1 file2.txt 1 0.03
如果您不想要其中包含 NA
值的行,请将整个内容包装在 na.omit
中。
na.omit(
merged.stack(mydf, var.stubs = c("topic", "proportion"),
sep = "var.stubs")[, .time_1 := NULL])
# id file topic proportion
# 1: 0 file1.txt 0 0.01
# 2: 1 file2.txt 0 0.01
# 3: 1 file2.txt 1 0.03
问题如下:我们有一个csv文件,数据格式有些异常。 R 很大,我肯定缺少一些简短的解决方案。
给定一个文件,我们读取它并获得以下形式的数据框:
# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03
是否有任何简短的方法可以将其转换为此数据帧:
id file topic proportion
0 file1.txt 0 0.01
1 file2.txt 0 0.01
1 file2.txt 1 0.03
我们在哪里有固定数量的列?主题比例对的数量没有定义,可以非常大。谢谢!
有一种方法可以继续。我想 data
包含保存为 .csv
文件的文件路径:
library(plyr)
df = read.csv(data)
names = c("id","file","topic","proportion")
extractDF = function(u) setNames(df[,c(1,2,u,u+1)], names)
newDF = ldply(seq(3,length(df)-1,by=2), extractDF)
newDF[complete.cases(newDF),]
# id file topic proportion
#1 0 file1.txt 0 0.01
#2 1 file2.txt 0 0.01
#4 1 file2.txt 1 0.03
数据如下,保存为csv
格式:
# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03
你可以试试我的 "splitstackshape" 包中的 merged.stack
。
假定这是您的起始数据....
mydf <- read.table(
text = "id, file, topic, proportion, topic, proportion
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03",
header = TRUE, sep = ",", fill = TRUE)
mydf
# id file topic proportion topic.1 proportion.1
# 1 0 file1.txt 0 0.01 NA NA
# 2 1 file2.txt 0 0.01 1 0.03
你只需要做....
library(splitstackshape)
merged.stack(mydf, var.stubs = c("topic", "proportion"),
sep = "var.stubs")[, .time_1 := NULL][]
# id file topic proportion
# 1: 0 file1.txt 0 0.01
# 2: 0 file1.txt NA NA
# 3: 1 file2.txt 0 0.01
# 4: 1 file2.txt 1 0.03
如果您不想要其中包含 NA
值的行,请将整个内容包装在 na.omit
中。
na.omit(
merged.stack(mydf, var.stubs = c("topic", "proportion"),
sep = "var.stubs")[, .time_1 := NULL])
# id file topic proportion
# 1: 0 file1.txt 0 0.01
# 2: 1 file2.txt 0 0.01
# 3: 1 file2.txt 1 0.03