提取 r 因子中的单词
extract words in r factor
我有这样一个数据集:
df <- data.frame(
text = c("Update AV Line 204 to Los Angeles will be ...",
"91 Line 700 to RiversideDowntown is delayed 15 minutes ...",
"VC Line 102 to Los Angeles is delayed 1520 minutes ...",
"Update AV Line 227 to Lancaster is terminated Via Princessa ",
"RIV Line 411 to Los Angeles is delayed 10 minutes ...",
"SB Line 312 to San Bernardino is delayed up to ...",
"SB Line 327 to Los Angeles is delayed up to 15..."), stringsAsFactors = T)
df
并且我需要在新字段中提取关键字,以便最终产品看起来像这样:
> df
text LinesExtracted
1 Update AV Line 204 to Los Angeles will be ... Line 204 to Los Angeles
2 91 Line 700 to RiversideDowntown is delayed 15 minutes ... Line 700 to Riverside Downtown
3 VC Line 102 to Los Angeles is delayed 1520 minutes ... Line 102 to Los Angeles
4 UpdateAV Line 227 to Lancaster is terminated Via Princessa Line 227 to Lancaster
5 RIV Line 411 to Los Angeles is delayed 10 minutes ... Line 411 to Los Angeles
6 SB Line 312 to San Bernardino is delayed up to ... Line 312 to San Bernardino
7 SB Line 327 to Los Angeles is delayed up to 15... Line 327 to Los Angeles
谢谢。
由于正则表达式可能难以阅读,我将其分成几步:
df$LinesExtracted <- gsub("^.*Line","Line",df$text)
df$LinesExtracted <- gsub(" will be .*$","",df$LinesExtracted)
df$LinesExtracted <- gsub(" is .*$","",df$LinesExtracted)
df$LinesExtracted <- gsub("([a-z])([A-Z])","\1 \2",df$LinesExtracted,perl=TRUE)
我有这样一个数据集:
df <- data.frame(
text = c("Update AV Line 204 to Los Angeles will be ...",
"91 Line 700 to RiversideDowntown is delayed 15 minutes ...",
"VC Line 102 to Los Angeles is delayed 1520 minutes ...",
"Update AV Line 227 to Lancaster is terminated Via Princessa ",
"RIV Line 411 to Los Angeles is delayed 10 minutes ...",
"SB Line 312 to San Bernardino is delayed up to ...",
"SB Line 327 to Los Angeles is delayed up to 15..."), stringsAsFactors = T)
df
并且我需要在新字段中提取关键字,以便最终产品看起来像这样:
> df
text LinesExtracted
1 Update AV Line 204 to Los Angeles will be ... Line 204 to Los Angeles
2 91 Line 700 to RiversideDowntown is delayed 15 minutes ... Line 700 to Riverside Downtown
3 VC Line 102 to Los Angeles is delayed 1520 minutes ... Line 102 to Los Angeles
4 UpdateAV Line 227 to Lancaster is terminated Via Princessa Line 227 to Lancaster
5 RIV Line 411 to Los Angeles is delayed 10 minutes ... Line 411 to Los Angeles
6 SB Line 312 to San Bernardino is delayed up to ... Line 312 to San Bernardino
7 SB Line 327 to Los Angeles is delayed up to 15... Line 327 to Los Angeles
谢谢。
由于正则表达式可能难以阅读,我将其分成几步:
df$LinesExtracted <- gsub("^.*Line","Line",df$text)
df$LinesExtracted <- gsub(" will be .*$","",df$LinesExtracted)
df$LinesExtracted <- gsub(" is .*$","",df$LinesExtracted)
df$LinesExtracted <- gsub("([a-z])([A-Z])","\1 \2",df$LinesExtracted,perl=TRUE)