由连字符组成的数字序列,没有连字符出现

Sequence of numbers by hyphen without hyphenating single occurrences

我想生成可读的数字序列(例如 1, 2, 3, 4 = 1-4),但对于序列中每个数字必须有四位数字的一组数据(例如 99 = 00991 = 00011022 = 1022) AND 每个数字前面有不同的字母。

我正在查看 to this ,它几乎完全符合我的要求,但有两个警告:

  1. 如果有一个单独的数字没有出现在一个序列中,它会出现两次,中间有一个连字符
  2. 如果有几个独立的数字没有出现在一个序列中,它们将不会被包含在结果中
### Create Data Set ====
## Create the data for different tags. I'm only using two unique levels here, but in my dataset I've got
## 400+ unique levels.
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))

## Combine data
my.seq1 <- c(FM, SC)

## Sort data by number in sequence
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

### Attempt Number Sequencing ====
## Get the letters
sp.tags <- substr(my.seq1, 1, 2)

## Get the readable number sequence
lapply(split(my.seq1, sp.tags), ## Split data by the tag ID
       function(x){
  
  ## Get the run lengths as per [previous answer][1]
  rl <- rle(c(1, pmin(diff(as.numeric(substr(x, 3, 7))), 2)))
  
  ## Generate number sequence by separator as per [previous answer][1]
  seq2 <- paste0(x[c(1, cumsum(rl$lengths))], c("-", ",")[rl$values], collapse="")
  
  return(substr(seq2, 1, nchar(seq2)-1))
})

## Combine lists and sort elements
my.seq2 <- unlist(strsplit(do.call(c, my.seq2), ","))
my.seq2 <- my.seq2[order(substr(my.seq2, 3, 7))]
names(my.seq2) <- NULL

my.seq2
[1] "FM0001-FM0001" "SC0002-SC0004" "FM0016-FM0019" "FM0028" "SC0039"

my.seq1
[1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021"
[13] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"

主要问题是:

  1. 数据集中完全缺少某些值(例如 FM0021FM0024FM0026
  2. 序列中的第一个数字 (FM0001) 中间有一个连字符

我觉得我通过使用 A5C1D2H2I1M1N2O1R2T1's 来利用 seqToHumanReadable 变得更暖和了,因为它非常优雅并且解决了这两个问题。还有两个问题是我无法在每个数字之前标记 ID 并且无法将数字的数量强制为四(例如 0004 变为 4)。

library(R.utils)

lapply(split(my.seq1, sp.tags), function(x){
  return(unlist(strsplit(seqToHumanReadable(substr(x, 3, 7)), ',')))
})

$FM
[1] "1"      " 16-19" " 21"    " 24"    " 26"    " 28"   

$SC
[1] "2-4" " 10" " 12" " 14" " 33" " 36" " 39"

理想的结果是:

"FM0001, SC002-SC004, SC0012, SC0014, FM0017-FM0019, FM0021, FM0024, FM0026, FM0028, SC0033, SC0036, SC0039"

有什么想法吗?这是其中一件用手做起来非常简单但需要花费很长时间的事情,你会认为它存在一个函数但我还没有找到它或者它不存在 :(

这应该怎么办?

# get the prefix/tag and number
tag <- gsub("(^[A-z]+)(.+)", "\1", my.seq1)
num <- gsub("([A-z]+)(\d+$)", "\2", my.seq1)

# get a sequence id
n <- length(tag)
do_match <- c(FALSE, diff(as.numeric(num)) == 1 & tag[-1] == tag[-n])
seq_id <- cumsum(!do_match) # a sequence id

# tapply to combine the result
res <- setNames(tapply(my.seq1, seq_id, function(x)
  if(length(x) < 2)
    return(x)
  else
    paste(x[1], x[length(x)], sep = "-")), NULL)

# show the result
res
#R>  [1] "FM0001"        "SC0002-SC0004" "SC0010"        "SC0012"        "SC0014"        "FM0016-FM0019" "FM0021"       
#R>  [8] "FM0024"        "FM0026"        "FM0028"        "SC0033"        "SC0036"        "SC0039"

# compare with 
my.seq1
#R>  [1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021" "FM0024"
#R> [14] "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"

数据

FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
my.seq1 <- c(FM, SC)
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]