如何删除R wordcloud中的奇怪字符

Question

我正在尝试使用语料库和各种 tm_map 函数在 R 中构建词云。问题是我一直在返回这个奇怪的符号，那个带有欧元符号和倒置引号的符号。它在我的语料库中排名第二。（还有一两个其他人，但他们远没有那么频繁，所以问题不大。）

Word cloud with rogue €“

有什么办法可以解决这个问题吗？

这是 .txt 格式文本在被拉入 R 之前的示例：

The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform. It had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. “Bi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.” Zerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. “It’s something we’re keeping an eye on. It’s on the wishlist rather than the roadmap,” he said.

这是通过 Corpus() 拉入 R 后的结果：

The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform.\n\nIt had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. â€œBi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.â€\u009d\n\nZerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. â€œItâ€™s something weâ€™re keeping an eye on. Itâ€™s on the wishlist rather than the roadmap,â€\u009d he said.

然后我运行这个代码：

# Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# Remove your own stop word
# specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("new", "products", "way", "back", 
"can", "need", "also", "Ã¢", "look", "will", "one", "right",
                                    "move", "gorge", "mathieu", "like", 
"said", "€“", "â€“", "â", "data",
                                    "use", "storage"))
# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)

之后，相同的正文如下所示：

virtual replication added replication aws previously oneway amazon cloud platform taken longer develop aws zerto technology evangelist gjisbert janssen van doorn €œbidirectional replication azure started try develop natively via apis clouds support taken longer awsâ€\u009d zerto added bidirectional replication ibm cloud van doorn company plan add support google cloud platform €œitâ€™s something weâ€™re keeping eye itâ€™s wishlist rather roadmap

所以，那些 tm_map 函数还没有摆脱所有垃圾，所以我运行来自这个的词云仍然包含它们。

有什么解决办法吗？

Answer 1

如果您不介意使用额外的包，您可以使用 textclean 包，它与 tm 函数结合使用效果很好。这个包包含各种有用的函数，用于清理带有奇怪字符、url、表情符号等的文本。对于示例文本，您需要使用函数 replace_curly_quote 来删除 ” 和 ' 字符，并使用 replace_contraction 来替换"it's" 到 "it is"。请参阅下面的工作示例。毕竟，您可以使用 wordcloud 包来创建词云。

txt <- "The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform. It had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. “Bi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.” Zerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. “It’s something we’re keeping an eye on. It’s on the wishlist rather than the roadmap,” he said."

library(tm)
library(textclean)

corpus <- VCorpus(VectorSource(txt))
corpus <- tm_map(corpus, content_transformer(tolower))

# function from textclean to remove curly quotes ” and ’
corpus <- tm_map(corpus, replace_curly_quote)
# function from textclean to replace "it's" to "it is"
corpus <- tm_map(corpus, replace_contraction)

# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))

my_stopwords <- c("new", "products", "way", "back", "can", "need", "also", 
                  "look", "will", "one", "right","move", "gorge", "mathieu", 
                  "like", "said", "data","use", "storage")

corpus <- tm_map(corpus, removeWords, my_stopwords)

#remove created whitespaces
corpus <- tm_map(corpus, stripWhitespace)

content(corpus)
[[1]]
[1] " virtual replication added replication aws previously oneway amazon cloud platform taken longer develop aws zerto technology evangelist gjisbert janssen van doorn bidirectional replication azure started try develop natively via apis clouds support taken longer aws zerto added bidirectional replication ibm cloud van doorn company plan add support google cloud platform something keeping eye wishlist rather roadmap "

如何删除R wordcloud中的奇怪字符

How to remove odd characters in R wordcloud

r

word-cloud

tm