带有 TD 和 Quanteda 西班牙语字符的 R 西班牙语词频矩阵
R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters
我正在尝试学习如何使用 Twitter 数据进行一些文本分析。我 运行 在创建词频矩阵时遇到了问题。
我用西班牙语文本(带有特殊字符)创建语料库,没有任何问题。
但是,当我创建词频矩阵(使用 quanteda 或 tm 库)时,西班牙字符未按预期显示(我看到的不是 canción,而是 canción)。
关于如何获取词频矩阵来存储具有正确字符的文本,有什么建议吗?
感谢您的帮助。
注意:我更喜欢使用 quanteda 库,因为最终我将创建一个词云,而且我想我更好地理解这个库的方法。我也在用 Windows 机器。
我试过 Encoding(tw2) <- "UTF-8" 但没有成功。
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT @None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\b\W*@\w+)+)", "", clean_tw2)
clean_tw2 = gsub("@\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\s+|\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms
让我猜猜...您在使用 Windows 吗?在 macOS 上运行良好:
clean_tw2
## [1] "enmascarados si masduro chingán si quieres aguantas canción"
Encoding(clean_tw2)
## [1] "UTF-8"
dfm(clean_tw2)
## Document-feature matrix of: 1 document, 7 features (0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs enmascarados si masduro chingán quieres aguantas canción
## text1 1 2 1 1 1 1 1
我的系统信息:
sessionInfo()
# R version 3.4.4 (2018-03-15)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.4
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] tm_0.7-3 NLP_0.1-11 dplyr_0.7.4 quanteda_1.1.6
在 windows 平台上创建 DFM 时,quanteda(和 tm)似乎正在丢失编码。在 this tidytext 问题中,取消嵌套标记也会出现同样的问题。现在工作正常,quanteda
的 tokens
也工作正常。
如果我对 dfm
的 @Dimnames$features
强制执行 UTF-8
或 latin1
编码,您会得到正确的结果。
....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
如果您执行以下操作:
Encoding(tdm_quan@Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
我正在尝试学习如何使用 Twitter 数据进行一些文本分析。我 运行 在创建词频矩阵时遇到了问题。 我用西班牙语文本(带有特殊字符)创建语料库,没有任何问题。
但是,当我创建词频矩阵(使用 quanteda 或 tm 库)时,西班牙字符未按预期显示(我看到的不是 canción,而是 canción)。
关于如何获取词频矩阵来存储具有正确字符的文本,有什么建议吗?
感谢您的帮助。
注意:我更喜欢使用 quanteda 库,因为最终我将创建一个词云,而且我想我更好地理解这个库的方法。我也在用 Windows 机器。
我试过 Encoding(tw2) <- "UTF-8" 但没有成功。
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT @None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\b\W*@\w+)+)", "", clean_tw2)
clean_tw2 = gsub("@\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\s+|\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms
让我猜猜...您在使用 Windows 吗?在 macOS 上运行良好:
clean_tw2
## [1] "enmascarados si masduro chingán si quieres aguantas canción"
Encoding(clean_tw2)
## [1] "UTF-8"
dfm(clean_tw2)
## Document-feature matrix of: 1 document, 7 features (0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs enmascarados si masduro chingán quieres aguantas canción
## text1 1 2 1 1 1 1 1
我的系统信息:
sessionInfo()
# R version 3.4.4 (2018-03-15)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.4
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] tm_0.7-3 NLP_0.1-11 dplyr_0.7.4 quanteda_1.1.6
在 windows 平台上创建 DFM 时,quanteda(和 tm)似乎正在丢失编码。在 this tidytext 问题中,取消嵌套标记也会出现同样的问题。现在工作正常,quanteda
的 tokens
也工作正常。
如果我对 dfm
的 @Dimnames$features
强制执行 UTF-8
或 latin1
编码,您会得到正确的结果。
....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
如果您执行以下操作:
Encoding(tdm_quan@Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1