如何使用 R 中的范围（开始-结束）值将单列拆分为多列？

Question

我有一个包含多个值的文本文件，但是，当加载到 R 中时没有分隔符来区分它。有一个辅助文件根据开始和结束位置定义每一列。

我试图通过已经存在的解决方案，但无法获得有关多列基于范围的描述的信息

数据看起来像：

    Column1                                                 
    --------------------------------------------------------
    00000000000102019000000000000000000049491000000000004CAD   
    00000000000102019000000000000000000049491000000000005CAP    
    00000000000102019000000000000000000049491000000000023GSP  
    00000000000102019000000000000000000049491000000000030MUD

字段范围定义为：

    Field Name   | Start | End 
    --------------------------
     COL1         | 1     | 2
     COL2         | 13    | 17
     COL3         | 18    | 12
     ....

根据上述范围，我有大约 200,000 行，每行有 55 列。

我不确定如何输入多个范围来创建一个新的数据框，所有 55 列都根据起始值和结束值进行拆分。

谁能帮我解决这个问题？

Answer 1

试试下面的代码。请注意，我根据您提到的数据编写了代码。

    example <- c("00000000000102019000000000000000000049491000000000004CAD","00000000000102019000000000000000000049491000000000004CAD")

    name <- c("COL1","COL2","COL3")
    start <- c(1,13,18)
    end <- c(2,17,22)

    # save the second file for reference
    range_df <- data.frame(Field_name=name,Start=start,End=end)

    # Make a function that splits according the second file
    split_cols <- function(string){
      # Get the `n` rows of the second file for iteration
      n <- nrow(range_df)
      # Declare an empty `data.frame` for save the splitted data
      cols <- data.frame(matrix(NA,1,n))
      for(i in 1:n){
        # Get the range of characters with `substr` function
        # The start and end range is defined in the `range_df`
        # column 2 is the `start` character and columns 3 the `end`
        cols[,i] <- substr(string,range_df[i,2],range_df[i,3])
        # save it in each column of the new data frame named `cols`
      }
      # Return the values of the `splitted` string as data.frame 
      return(cols)
    }

    # In order to apply the function above for each row you can either 
    # use `for` loop or apply function. In this case I used `lapply`
    all_data <- lapply(example,split_cols)


    # `lapply` makes the job done, however is in a `list` form. Yo can 
    # structure the data with do.call function
    final_df <- do.call("rbind",all_data)

    # Finally add the columns names from the secondary df
    names(final_df) <- as.character(range_df[,1])

当然可以大大改进此代码，但这可以完成工作。

希望对您有所帮助

Answer 2

以下是您可以尝试的方法：

给定：
1) raw_data 是你的文本文件
2) mapping 是你的列宽 table

对 mapping 按行应用一个函数，从 raw_data 中提取相应的列。 mapping 的所有行的输出对应于您需要的每一列。

raw_data <- data.frame(str_data = c('00000000000102019000000000000000000049491000000000004CAD', 
                                    '00000000000102019000000000000000000049491000000000005CAP', 
                                    '00000000000102019000000000000000000049491000000000023GSP', 
                                    '00000000000102019000000000000000000049491000000000030MUD'))


mapping = data.frame('columns' = c('COL1', 'COL2', 'COL3'), 
                     'start' = c(1,13,18), 
                     'end' = c(2,17,22))

# Funtion that returns column within start and end indexes
columns = function(x, str_table) {
  col = substr(str_table[,1], x['start'], x['end'])
  return(col)
}

# Apply the function columns to rows in mapping 
tab = data.frame(apply(mapping, MARGIN = 1,columns, raw_data))
colnames(tab) <- mapping$columns

这是输出：

  COL1  COL2  COL3
1   00 02019 00000
2   00 02019 00000
3   00 02019 00000
4   00 02019 00000

如何使用 R 中的范围（开始-结束）值将单列拆分为多列？

How to split a single column into multiple column using range (Start-End) values in R?

substring

r

strsplit