如何使用 R 中的范围(开始-结束)值将单列拆分为多列?
How to split a single column into multiple column using range (Start-End) values in R?
我有一个包含多个值的文本文件,但是,当加载到 R 中时没有分隔符来区分它。有一个辅助文件根据开始和结束位置定义每一列。
我试图通过已经存在的解决方案,但无法获得有关多列基于范围的描述的信息
数据看起来像:
Column1
--------------------------------------------------------
00000000000102019000000000000000000049491000000000004CAD
00000000000102019000000000000000000049491000000000005CAP
00000000000102019000000000000000000049491000000000023GSP
00000000000102019000000000000000000049491000000000030MUD
字段范围定义为:
Field Name | Start | End
--------------------------
COL1 | 1 | 2
COL2 | 13 | 17
COL3 | 18 | 12
....
根据上述范围,我有大约 200,000 行,每行有 55 列。
我不确定如何输入多个范围来创建一个新的数据框,所有 55 列都根据起始值和结束值进行拆分。
谁能帮我解决这个问题?
试试下面的代码。请注意,我根据您提到的数据编写了代码。
example <- c("00000000000102019000000000000000000049491000000000004CAD","00000000000102019000000000000000000049491000000000004CAD")
name <- c("COL1","COL2","COL3")
start <- c(1,13,18)
end <- c(2,17,22)
# save the second file for reference
range_df <- data.frame(Field_name=name,Start=start,End=end)
# Make a function that splits according the second file
split_cols <- function(string){
# Get the `n` rows of the second file for iteration
n <- nrow(range_df)
# Declare an empty `data.frame` for save the splitted data
cols <- data.frame(matrix(NA,1,n))
for(i in 1:n){
# Get the range of characters with `substr` function
# The start and end range is defined in the `range_df`
# column 2 is the `start` character and columns 3 the `end`
cols[,i] <- substr(string,range_df[i,2],range_df[i,3])
# save it in each column of the new data frame named `cols`
}
# Return the values of the `splitted` string as data.frame
return(cols)
}
# In order to apply the function above for each row you can either
# use `for` loop or apply function. In this case I used `lapply`
all_data <- lapply(example,split_cols)
# `lapply` makes the job done, however is in a `list` form. Yo can
# structure the data with do.call function
final_df <- do.call("rbind",all_data)
# Finally add the columns names from the secondary df
names(final_df) <- as.character(range_df[,1])
当然可以大大改进此代码,但这可以完成工作。
希望对您有所帮助
以下是您可以尝试的方法:
给定:
1) raw_data
是你的文本文件
2) mapping
是你的列宽 table
对 mapping
按行应用一个函数,从 raw_data
中提取相应的列。 mapping
的所有行的输出对应于您需要的每一列。
raw_data <- data.frame(str_data = c('00000000000102019000000000000000000049491000000000004CAD',
'00000000000102019000000000000000000049491000000000005CAP',
'00000000000102019000000000000000000049491000000000023GSP',
'00000000000102019000000000000000000049491000000000030MUD'))
mapping = data.frame('columns' = c('COL1', 'COL2', 'COL3'),
'start' = c(1,13,18),
'end' = c(2,17,22))
# Funtion that returns column within start and end indexes
columns = function(x, str_table) {
col = substr(str_table[,1], x['start'], x['end'])
return(col)
}
# Apply the function columns to rows in mapping
tab = data.frame(apply(mapping, MARGIN = 1,columns, raw_data))
colnames(tab) <- mapping$columns
这是输出:
COL1 COL2 COL3
1 00 02019 00000
2 00 02019 00000
3 00 02019 00000
4 00 02019 00000
我有一个包含多个值的文本文件,但是,当加载到 R 中时没有分隔符来区分它。有一个辅助文件根据开始和结束位置定义每一列。
我试图通过已经存在的解决方案,但无法获得有关多列基于范围的描述的信息
数据看起来像:
Column1
--------------------------------------------------------
00000000000102019000000000000000000049491000000000004CAD
00000000000102019000000000000000000049491000000000005CAP
00000000000102019000000000000000000049491000000000023GSP
00000000000102019000000000000000000049491000000000030MUD
字段范围定义为:
Field Name | Start | End
--------------------------
COL1 | 1 | 2
COL2 | 13 | 17
COL3 | 18 | 12
....
根据上述范围,我有大约 200,000 行,每行有 55 列。
我不确定如何输入多个范围来创建一个新的数据框,所有 55 列都根据起始值和结束值进行拆分。
谁能帮我解决这个问题?
试试下面的代码。请注意,我根据您提到的数据编写了代码。
example <- c("00000000000102019000000000000000000049491000000000004CAD","00000000000102019000000000000000000049491000000000004CAD")
name <- c("COL1","COL2","COL3")
start <- c(1,13,18)
end <- c(2,17,22)
# save the second file for reference
range_df <- data.frame(Field_name=name,Start=start,End=end)
# Make a function that splits according the second file
split_cols <- function(string){
# Get the `n` rows of the second file for iteration
n <- nrow(range_df)
# Declare an empty `data.frame` for save the splitted data
cols <- data.frame(matrix(NA,1,n))
for(i in 1:n){
# Get the range of characters with `substr` function
# The start and end range is defined in the `range_df`
# column 2 is the `start` character and columns 3 the `end`
cols[,i] <- substr(string,range_df[i,2],range_df[i,3])
# save it in each column of the new data frame named `cols`
}
# Return the values of the `splitted` string as data.frame
return(cols)
}
# In order to apply the function above for each row you can either
# use `for` loop or apply function. In this case I used `lapply`
all_data <- lapply(example,split_cols)
# `lapply` makes the job done, however is in a `list` form. Yo can
# structure the data with do.call function
final_df <- do.call("rbind",all_data)
# Finally add the columns names from the secondary df
names(final_df) <- as.character(range_df[,1])
当然可以大大改进此代码,但这可以完成工作。
希望对您有所帮助
以下是您可以尝试的方法:
给定:
1) raw_data
是你的文本文件
2) mapping
是你的列宽 table
对 mapping
按行应用一个函数,从 raw_data
中提取相应的列。 mapping
的所有行的输出对应于您需要的每一列。
raw_data <- data.frame(str_data = c('00000000000102019000000000000000000049491000000000004CAD',
'00000000000102019000000000000000000049491000000000005CAP',
'00000000000102019000000000000000000049491000000000023GSP',
'00000000000102019000000000000000000049491000000000030MUD'))
mapping = data.frame('columns' = c('COL1', 'COL2', 'COL3'),
'start' = c(1,13,18),
'end' = c(2,17,22))
# Funtion that returns column within start and end indexes
columns = function(x, str_table) {
col = substr(str_table[,1], x['start'], x['end'])
return(col)
}
# Apply the function columns to rows in mapping
tab = data.frame(apply(mapping, MARGIN = 1,columns, raw_data))
colnames(tab) <- mapping$columns
这是输出:
COL1 COL2 COL3
1 00 02019 00000
2 00 02019 00000
3 00 02019 00000
4 00 02019 00000