tm_map 根据条件合并行
tm_map merging lines on condition
我从 pdf 文件中提取文本并创建了一个语料库对象。
在文本中,我有以“,”或“-”结尾的行,我想在它们后面追加以下行,因为它属于同一个句子。
比如我有
[1566] "this and other southeastern states (Eukerria saltensis,"
[1567] "Sparganophilus helenae, Sp. tennesseensis). In the"
我想改用
[1566] "this and other southeastern states (Eukerria saltensis, Sparganophilus helenae, Sp. tennesseensis). In the"
我试过替换换行符之类的方法,但没有成功:
tm_map(myCorpus, content_transformer(gsub), pattern =",$\n",replacement = "")
知道如何在 R 中执行此操作吗?
这是一种方法,基于您通过换行符拆分的想法...
txt <- c("aaa","bbc,","df","fgh-","jkh-","dfsf","gghf")
txt2 <- paste0(txt,collapse="\n")
txt2 <- gsub(",\n",", ",txt2)
txt2 <- gsub("\-\n","-",txt2)
txt2 <- unlist(strsplit(txt2,"\n"))
txt2
[1] "aaa" "bbc, df" "fgh-jkh-dfsf" "gghf"
谢谢,它确实有效!
我不得不将它放在一个函数中以使其与 tm_map 一起使用,尽管如此:
clean.X <- function(X){
X2 <- paste0(X,collapse="\n")
X2 <- gsub(",\n",", ",X2)
X2 <- gsub("\-\n","-",X2)
X2 <- unlist(strsplit(X2,"\n"))
return(X2)
}
txt2 <- tm_map(txt, content_transformer(clean.X))
我从 pdf 文件中提取文本并创建了一个语料库对象。
在文本中,我有以“,”或“-”结尾的行,我想在它们后面追加以下行,因为它属于同一个句子。
比如我有
[1566] "this and other southeastern states (Eukerria saltensis,"
[1567] "Sparganophilus helenae, Sp. tennesseensis). In the"
我想改用
[1566] "this and other southeastern states (Eukerria saltensis, Sparganophilus helenae, Sp. tennesseensis). In the"
我试过替换换行符之类的方法,但没有成功:
tm_map(myCorpus, content_transformer(gsub), pattern =",$\n",replacement = "")
知道如何在 R 中执行此操作吗?
这是一种方法,基于您通过换行符拆分的想法...
txt <- c("aaa","bbc,","df","fgh-","jkh-","dfsf","gghf")
txt2 <- paste0(txt,collapse="\n")
txt2 <- gsub(",\n",", ",txt2)
txt2 <- gsub("\-\n","-",txt2)
txt2 <- unlist(strsplit(txt2,"\n"))
txt2
[1] "aaa" "bbc, df" "fgh-jkh-dfsf" "gghf"
谢谢,它确实有效!
我不得不将它放在一个函数中以使其与 tm_map 一起使用,尽管如此:
clean.X <- function(X){
X2 <- paste0(X,collapse="\n")
X2 <- gsub(",\n",", ",X2)
X2 <- gsub("\-\n","-",X2)
X2 <- unlist(strsplit(X2,"\n"))
return(X2)
}
txt2 <- tm_map(txt, content_transformer(clean.X))