R将由空格分隔的字符串数据拆分为列
R Split string data delimited by spaces into columns
我有一个包含一列的大型数据框,其中包含以空格分隔的不同数值,我需要将其提取并按列组织
<Call Begin=6.0982886400000051 End=6.1078732800000051 MaxFreq=40893.5546875 MinFreq=35400.390625 PeakFreq=39672.8515625 PeakFreqs=39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 36621.09375 36621.09375 36621.09375 36621.09375 Intensity=-14.902734633213136 Periodicity=0.853448275862069 Shape=- CallType=cf-n Species=Pipistrellus kuhlii (77%), Pipistrellus nathusii (77%) Custom=false />
这是关于我的数据的更多信息
'data.frame':39 obs. of 1 variable $ x1: Factor w/ 120 levels "
<double>25.318181818181806</double>",..: 66 67 68 69 70 71 72 73 74 75...
我需要这样的东西:
call_begin call_end maxfrec minfrec
1 0.59170816000000048 0.60006400000000049 531.005.859.375 433.349.609.375
2 0.7636582400000006 0.77135872000000061 531.005.859.375 42.724.609.375
peakfrec
1 482.177.734.375
2 469.970.703.125
我有一些想法来实现这个,首先尝试在列中分离,使用
strsplit,然后使用 substr 函数提取数字,最后使用 rbind 生成一个 table,我发现了一些具有相关主题的线程,但我可以在我的数据中复制它。
如有任何帮助,我将不胜感激,如果不清楚,请告诉我。
gsub 是我的最爱。
strList = list("<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375", "<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125")
dataExtract <- function(str){
str = gsub("^<Call Begin=([0-9.]+) End=([0-9.]+) MaxFreq=([0-9.]+) MinFreq=([0-9.]+) PeakFreq=([0-9.]+)", "\1 \2 \3 \4 \5", str)
str = unlist(strsplit(str, " "))
return(sapply(str, FUN=as.numeric, USE.NAMES=F))
}
#dataExtract(strList[[1]])
res = matrix(unlist(lapply(str, FUN=dataExtract)), ncol=5, byrow=F)
colnames(res) = c("Call Begin", "End", "MaxFreq", "MinFreq", "PeakFreq")
这完全取决于您的数据遵循该模式的严格程度。对于您提供的数据,您可以一次性拆分“”和“=”,并一次性提取相关列。
result <- do.call(rbind,lapply(strList,function(s) {strsplit(s,split = "[ =]")[[1]][c(3,5,7,9,11)]}))
然后您可以使用 names() 函数随意命名列。
与您所描述的类似的解决方案。这个解决方案有点通用,不依赖于列数:
text <- '<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375
<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125'
process_line <- function(line) {
sp <- strsplit(line, ' ')[[1]][-1]
cn <- sapply(sp, function(x) strsplit(x, "=")[[1]][1])
data <- sapply(sp, function(x) as.numeric(strsplit(x, "=")[[1]][2]))
names(data) <- cn
data
}
t(sapply(strsplit(text, "\n")[[1]], process_line, USE.NAMES = FALSE))
Begin End MaxFreq MinFreq PeakFreq
[1,] 0.5917082 0.6000640 53100.59 43334.96 48217.77
[2,] 0.7636582 0.7713587 53100.59 42724.61 46997.07
是基于test不被行分隔的假设,否则strsplit(text, "\n")[[1]]
和text
。
不需要使用正则表达式,因为可以通过 =
拆分较小的块来获取数据
我有一个包含一列的大型数据框,其中包含以空格分隔的不同数值,我需要将其提取并按列组织
<Call Begin=6.0982886400000051 End=6.1078732800000051 MaxFreq=40893.5546875 MinFreq=35400.390625 PeakFreq=39672.8515625 PeakFreqs=39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 36621.09375 36621.09375 36621.09375 36621.09375 Intensity=-14.902734633213136 Periodicity=0.853448275862069 Shape=- CallType=cf-n Species=Pipistrellus kuhlii (77%), Pipistrellus nathusii (77%) Custom=false />
这是关于我的数据的更多信息
'data.frame':39 obs. of 1 variable $ x1: Factor w/ 120 levels "
<double>25.318181818181806</double>",..: 66 67 68 69 70 71 72 73 74 75...
我需要这样的东西:
call_begin call_end maxfrec minfrec
1 0.59170816000000048 0.60006400000000049 531.005.859.375 433.349.609.375
2 0.7636582400000006 0.77135872000000061 531.005.859.375 42.724.609.375
peakfrec
1 482.177.734.375
2 469.970.703.125
我有一些想法来实现这个,首先尝试在列中分离,使用 strsplit,然后使用 substr 函数提取数字,最后使用 rbind 生成一个 table,我发现了一些具有相关主题的线程,但我可以在我的数据中复制它。
如有任何帮助,我将不胜感激,如果不清楚,请告诉我。
gsub 是我的最爱。
strList = list("<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375", "<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125")
dataExtract <- function(str){
str = gsub("^<Call Begin=([0-9.]+) End=([0-9.]+) MaxFreq=([0-9.]+) MinFreq=([0-9.]+) PeakFreq=([0-9.]+)", "\1 \2 \3 \4 \5", str)
str = unlist(strsplit(str, " "))
return(sapply(str, FUN=as.numeric, USE.NAMES=F))
}
#dataExtract(strList[[1]])
res = matrix(unlist(lapply(str, FUN=dataExtract)), ncol=5, byrow=F)
colnames(res) = c("Call Begin", "End", "MaxFreq", "MinFreq", "PeakFreq")
这完全取决于您的数据遵循该模式的严格程度。对于您提供的数据,您可以一次性拆分“”和“=”,并一次性提取相关列。
result <- do.call(rbind,lapply(strList,function(s) {strsplit(s,split = "[ =]")[[1]][c(3,5,7,9,11)]}))
然后您可以使用 names() 函数随意命名列。
与您所描述的类似的解决方案。这个解决方案有点通用,不依赖于列数:
text <- '<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375
<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125'
process_line <- function(line) {
sp <- strsplit(line, ' ')[[1]][-1]
cn <- sapply(sp, function(x) strsplit(x, "=")[[1]][1])
data <- sapply(sp, function(x) as.numeric(strsplit(x, "=")[[1]][2]))
names(data) <- cn
data
}
t(sapply(strsplit(text, "\n")[[1]], process_line, USE.NAMES = FALSE))
Begin End MaxFreq MinFreq PeakFreq
[1,] 0.5917082 0.6000640 53100.59 43334.96 48217.77
[2,] 0.7636582 0.7713587 53100.59 42724.61 46997.07
是基于test不被行分隔的假设,否则strsplit(text, "\n")[[1]]
和text
。
不需要使用正则表达式,因为可以通过 =