通过字符串解析合并数据框行
merge data frame rows by string parse
我正在尝试将具有以下结构的对话导入到数据框中:
conversation<-data.frame(
uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here"))
这种结构将使解析日期、时间、人物和消息变得相对容易。但是有一些消息带有换行符的情况,因此数据帧结构错误,如下所示:
conversation_errors<-data.frame(
uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
"lend me your arms,",
"fast as thunderbolts,",
"for a pillow on my journey."))
您将如何合并这些实例?有没有我不知道的包裹?
所需的函数将简单地识别缺少的结构和 "merge" 与前一行,这样我会得到:
conversation_fixed<-data.frame(
uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))
有什么想法吗?
假设您可以使用时间戳(在下面 properDataRegex
中表示)正确识别结构正确的行,那么就可以做到:
mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
"lend me your arms,",
"fast as thunderbolts,",
"for a pillow on my journey.",
"07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
"but it will get the job done.")
properDataRegex <- "^\d{2}/\d{2}/\d{4}\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) &
improperDataBool)
mydata[mergeWPrevIndex - 1] <-
paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
mydata <- mydata[-mergeWPrevIndex]
improperDataBool <- !grepl(properDataRegex, mydata)
}
mydata
## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."
在这里,mydata
是一个字符向量,但当然现在可以像您在问题中那样制作 data.frame,或者使用 read.table()
或 read.fwf()
.
这是另一种方法:
read.table(text=paste(gsub("(^\d{2}/\d{2}/\d{4}\s)", "\n\1", conversation_errors$uniquerow),
collapse = " "), sep = "\n", stringsAsFactors = F)[,1]
给出:
[1] "01/08/2015 2:49:49 pm: Person 1: Hello "
[2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you "
[3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."
(感谢 Ken 借来的正则表达式)
我正在尝试将具有以下结构的对话导入到数据框中:
conversation<-data.frame(
uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here"))
这种结构将使解析日期、时间、人物和消息变得相对容易。但是有一些消息带有换行符的情况,因此数据帧结构错误,如下所示:
conversation_errors<-data.frame(
uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
"lend me your arms,",
"fast as thunderbolts,",
"for a pillow on my journey."))
您将如何合并这些实例?有没有我不知道的包裹?
所需的函数将简单地识别缺少的结构和 "merge" 与前一行,这样我会得到:
conversation_fixed<-data.frame(
uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))
有什么想法吗?
假设您可以使用时间戳(在下面 properDataRegex
中表示)正确识别结构正确的行,那么就可以做到:
mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
"01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
"01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
"lend me your arms,",
"fast as thunderbolts,",
"for a pillow on my journey.",
"07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
"but it will get the job done.")
properDataRegex <- "^\d{2}/\d{2}/\d{4}\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) &
improperDataBool)
mydata[mergeWPrevIndex - 1] <-
paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
mydata <- mydata[-mergeWPrevIndex]
improperDataBool <- !grepl(properDataRegex, mydata)
}
mydata
## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."
在这里,mydata
是一个字符向量,但当然现在可以像您在问题中那样制作 data.frame,或者使用 read.table()
或 read.fwf()
.
这是另一种方法:
read.table(text=paste(gsub("(^\d{2}/\d{2}/\d{4}\s)", "\n\1", conversation_errors$uniquerow),
collapse = " "), sep = "\n", stringsAsFactors = F)[,1]
给出:
[1] "01/08/2015 2:49:49 pm: Person 1: Hello "
[2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you "
[3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."
(感谢 Ken 借来的正则表达式)