TM 包中删除 URLS 的 gsub 函数不会删除整个字符串

Question

我在使用 r 文本挖掘包 (tm) 的脚本中使用此函数来从推文中删除 URLs。令我惊讶的是，清理后有一些剩余的 "http" 单词和 URL 本身的片段（例如 t.co）。看起来有些 URLS 已经完全消失，而另一些只是分解成组件。可能是什么原因？注意：我把 .在 t.co URL 中。 Whosebug 不允许将 URL 提交到 t.co 个地址。

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\|")
removeURL <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)

清理前的文本

VOTE TODAY! Go to https://tco/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!â€¦ https://tco/KPQ5EY9VwQ

清理后的文本

vote today go https tco mxraxyntjy find polling location going make america great https tco kpqeyvwq

Answer 1

您正在删除您的 removeURL 函数正在查找的符号。此外，您需要确保使用 content_transformer() 创建适当的转换器函数。这是一个使用不同正则表达式删除 URL 的工作示例（它在 space 处停止）

library(tm)
test<-"VOTE TODAY! Go to https://t.com/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!â€¦ https://t.com/KPQ5EY9VwQ"

trumpcorpus1020to1109 <- VCorpus(VectorSource(test))
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\S+", "", x, perl=T))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\|")
content(trumpcorpus1020to1109[[1]])
# [1] "VOTE TODAY! Go to  to find your polling location. We are going to Make America Great Again!â€¦ "

TM 包中删除 URLS 的 gsub 函数不会删除整个字符串

gsub function in TM package to remove URLS does not remove the entire string

r

tm