
merge data frame rows by string parse


             uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                         "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                         "01/08/2015 2:59:19 pm: Person 1: Same here"))


                     uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                 "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                 "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
                                 "lend me your arms,",
                                 "fast as thunderbolts,",
                                 "for a pillow on my journey."))


所需的函数将简单地识别缺少的结构和 "merge" 与前一行,这样我会得到:

                    uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))


假设您可以使用时间戳(在下面 properDataRegex 中表示)正确识别结构正确的行,那么就可以做到:

mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
            "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
            "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
            "lend me your arms,",
            "fast as thunderbolts,",
            "for a pillow on my journey.",
            "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
            "but it will get the job done.")

properDataRegex <- "^\d{2}/\d{2}/\d{4}\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
    mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) & 
    mydata[mergeWPrevIndex - 1] <- 
        paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
    mydata <- mydata[-mergeWPrevIndex]
    improperDataBool <- !grepl(properDataRegex, mydata)

## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"                                                                                                    
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"                                                                                         
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."

在这里,mydata 是一个字符向量,但当然现在可以像您在问题中那样制作 data.frame,或者使用 read.table()read.fwf().


read.table(text=paste(gsub("(^\d{2}/\d{2}/\d{4}\s)", "\n\1", conversation_errors$uniquerow),
                      collapse = " "), sep = "\n", stringsAsFactors = F)[,1]


[1] "01/08/2015 2:49:49 pm: Person 1: Hello "                                                                                                   
[2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you "                                                                                        
[3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."

(感谢 Ken 借来的正则表达式)