由于#

Question

我正在尝试使用 tm 的函数 removeWords 从推文中删除主题标签。如您所知，主题标签以 # 开头，我想完全删除这些标签。但是，removeWords 不会删除它们：

> library(tm)
> removeWords(x = "WOW it is cool! #Ht https://google.com", words = c("#Ht", "https://google.com"))

[1] "WOW it is cool! #Ht "

如果我从 words 参数中删除 #，标签将被删除：

> removeWords(x = "WOW it is cool! #Ht https://google.com", words = c("Ht", "https://google.com"))
[1] "WOW it is cool! # "

留下孤儿 #。

为什么会这样？该函数不应该简单地删除单词，还是我遗漏了什么？ manual 在这里不是很有用。

Answer 1

不幸的是，我想不出解决它的好方法。您所看到的背后的原因是 removeWords 依赖于使用带有单词边界的正则表达式。不幸的是，“#”不算作单词边界，所以它基本上被忽略了。我希望通过一个很好的解决方法看到一个更好的答案，但你可能只需要做一些简单的事情，比如初始传递，你用一些关键字替换“#”，你添加到要删除的东西列表中代替符号并使用在创建要删除的单词列表时该关键字代替主题标签。

Answer 2

没有使用包 tm 但 stringr:

library(stringr)

replaceHashtags <- function(str,tags)
{
  repl <- rep("",length(tags))
  names(repl) <- tags
  return(stringr::str_replace_all(str, repl))
}

ExStr <- "WOW it is cool! #Ht #tag2 https://google.com"
Extags <- c("#Ht","#tag2")
replaceHashtags(ExStr,Extags)

[1] "WOW it is cool!   https://google.com"

这会从单个字符串中删除标签中指定的所有匹配主题标签。要将此应用于多个字符串，只需使用 sapply 等

Answer 3

您可以使用 textclean 包中的函数来帮助您。

library(textclean)
txt <- "WOW it is cool! #Ht https://google.com"

# remove twitter hashes
txt <- replace_hash(txt)
# remove urls
txt <- replace_url(txt)

txt
[1] "WOW it is cool!  "

要将其合并到 tm 中，请使用 tm_map 调用这些函数

...
# after creating corpus
my_corpus <- tm_map(my_corpus, content_transformer(replace_hash))
my_corpus <- tm_map(my_corpus, content_transformer(replace_url))
....
# rest of code

Answer 4

多么好的问题！这有点棘手：当你查看 tm::removeWords() 的源代码时，你会看到它做了什么：

gsub(sprintf("(*UCP)\b(%s)\b",
             paste(sort(words, decreasing = TRUE), collapse = "|")),
     "", x, perl = TRUE)

它与@Dason 提到的单词边界一起使用，这就是提取主题标签如此复杂的原因。但是您可以以此为灵感来构建您自己的函数：

# some tweets
tweets <- rep("WOW it is cool! #Ht https://google.com", times = 1e5)
remove <- c("#Ht", "https://google.com")

# our new function takes not only word boundary from the left side,
# but also a white space or string beginning
removeWords2 <- function(x, words) {
  gsub(sprintf("(\b|\s|^)(%s)\b", paste(sort(words, decreasing = TRUE), collapse = "|")), "", x)
}

# remove words
data <- removeWords2(tweets, remove)

# check that
head(data)
#> [1] "WOW it is cool!" "WOW it is cool!" "WOW it is cool!" "WOW it is cool!"
#> [5] "WOW it is cool!" "WOW it is cool!"

^{由 reprex package (v0.3.0)}

于 2020 年 7 月 17 日创建

它非常快并且按预期工作，而且您可以根据自己的需要调整它。

由于#

R tm package's `removeWords` not removing twitter hashtags from tweets due to #

r

tm