在 R 中,如何根据变量 start/stop 位置将数据集拆分为列?
In R, How to split a datset into columns based on variable start/stop positions?
我想根据行中的多对 start/stop 个位置拆分数据集的行
我可以用命令 substr
以普通方式为每个人执行此操作,但这似乎是一个糟糕的选择。
我实际上有 7 个数据集需要这样做,我希望有一种方法可以用 start/stop 对定义 array/vector,然后将其提供给 substr
命令。
任何帮助指导都会很棒
# I have a datset which contains records like this
string1 <- "08103128827DP 11 513452 131 Markett Hills Rd Coolingford XYZ 9876 14602012476 000000000000000000010784Y00000000000053815"
string2 <- "08203143982DP 12 611218 12 Magicra Waters Rd Huntley XXX 9081 14602012476 000000000000000000010784Y00000000000038443"
# Make a dummy datset
V1 <- c(string1, string2)
myData <- data.frame(V1)
head(myData)
# I would like to split into distinct fields for each row of the typically large dataset that I have
fld_1 <- substr(myData, 0, 3)
fld_2 <- substr(myData, 4, 11)
fld_3 <- substr(myData, 12, 16)
fld_4 <- as.numeric(substr(myData, 187, 198))/100
# The field widths vary, as do the data types
字段似乎由 2 个或更多空格分隔,而单个空格在字段内,因此将 2 个或更多空格中的任何 运行 替换为制表符,然后使用制表符分隔符读入:
read.delim(text = gsub(" +", "\t", as.character(myData$V1)),
as.is = TRUE, header = FALSE)
假设您有确切的开始和结束位置:
# (0) Make a dummy dataset
string1 <- "08103128827DP 11 513452 131 Markett Hills Rd Coolingford XYZ 9876 14602012476 000000000000000000010784Y00000000000053815"
string2 <- "08203143982DP 12 611218 12 Magicra Waters Rd Huntley XXX 9081 14602012476 000000000000000000010784Y00000000000038443"
V1 <- c(string1, string2)
# (1) Define positions and variable names
pos <- list("Var 1" = c(0, 13),
"Var 2" = c(22, 23),
"Var 3" = c(32, 37))
# (2) Extract variables as text
vars <- lapply(pos, function(x) {
substr(V1, x[1], x[2])
})
# (3) Assign classes
class(vars[["Var 2"]]) <- "numeric"
class(vars[["Var 3"]]) <- "numeric"
我想根据行中的多对 start/stop 个位置拆分数据集的行
我可以用命令 substr
以普通方式为每个人执行此操作,但这似乎是一个糟糕的选择。
我实际上有 7 个数据集需要这样做,我希望有一种方法可以用 start/stop 对定义 array/vector,然后将其提供给 substr
命令。
任何帮助指导都会很棒
# I have a datset which contains records like this
string1 <- "08103128827DP 11 513452 131 Markett Hills Rd Coolingford XYZ 9876 14602012476 000000000000000000010784Y00000000000053815"
string2 <- "08203143982DP 12 611218 12 Magicra Waters Rd Huntley XXX 9081 14602012476 000000000000000000010784Y00000000000038443"
# Make a dummy datset
V1 <- c(string1, string2)
myData <- data.frame(V1)
head(myData)
# I would like to split into distinct fields for each row of the typically large dataset that I have
fld_1 <- substr(myData, 0, 3)
fld_2 <- substr(myData, 4, 11)
fld_3 <- substr(myData, 12, 16)
fld_4 <- as.numeric(substr(myData, 187, 198))/100
# The field widths vary, as do the data types
字段似乎由 2 个或更多空格分隔,而单个空格在字段内,因此将 2 个或更多空格中的任何 运行 替换为制表符,然后使用制表符分隔符读入:
read.delim(text = gsub(" +", "\t", as.character(myData$V1)),
as.is = TRUE, header = FALSE)
假设您有确切的开始和结束位置:
# (0) Make a dummy dataset
string1 <- "08103128827DP 11 513452 131 Markett Hills Rd Coolingford XYZ 9876 14602012476 000000000000000000010784Y00000000000053815"
string2 <- "08203143982DP 12 611218 12 Magicra Waters Rd Huntley XXX 9081 14602012476 000000000000000000010784Y00000000000038443"
V1 <- c(string1, string2)
# (1) Define positions and variable names
pos <- list("Var 1" = c(0, 13),
"Var 2" = c(22, 23),
"Var 3" = c(32, 37))
# (2) Extract variables as text
vars <- lapply(pos, function(x) {
substr(V1, x[1], x[2])
})
# (3) Assign classes
class(vars[["Var 2"]]) <- "numeric"
class(vars[["Var 3"]]) <- "numeric"