使用 R 中的 tm 包进行文本挖掘，删除以 [http] 或任何其他特定单词开头的单词

Question

我是 R 和文本挖掘的新手。我从与某个术语相关的推特提要中制作了一个词云。我面临的问题是在 wordcloud 中它显示 http:... 或 htt... 我该如何处理这个问题我尝试使用元字符 * 但我仍然怀疑我是否正确应用它

tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\*"))

有人喜欢文本挖掘，请帮我解决这个问题。

Answer 1

如果您要从字符串中删除网址，您可以使用：

gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)

其中 x 将是：

x <- c("some text http://idontwantthis.com", 
         "same problem again http://pleaseremoveme.com")

如果您可以 post 数据样本，那么为您提供具体答案会更容易，但以下示例将为您提供没有 URL 的干净文本：

> clean_x <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
> clean_x
[1] "some text "          "same problem again "

附带一点，我建议在挖掘之前搜索现有的清理文本的方法可能是值得的。例如 clean 讨论的 here 函数将使您能够自动执行此操作。类似地，还有一些功能可以清除推文中的文本（#、@）、标点符号和其他不需要的条目。

Answer 2

将以下代码应用于语料库，用 space 替换字符串模式。字符串模式可以是要从词云中删除的 url 或术语。例如删除以 https:

开头的术语

替换为space

toSpace = content_transformer( function(x, pattern) gsub(pattern," ",x) )

tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "https*")

tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")