如何在 R 中处理 utf-8 字符
How to handle utf-8 characters in R
我正在尝试分析一些推文,并且是文本挖掘的新手。经过基本的预处理后,我的输出是:
> `head(tweet_corpus[[1]]$content)`
[1] "user father dysfunct selfish drag kid dysfunct run"
[2] "user user thank lyft credit use caus offer wheelchair van pdx disapoint getthank"
[3] "bihday majesti"
[4] "model love u take u time urã°âÿâ“â± ã°âÿâ˜â™ã°âÿâ˜âžã°âÿâ‘â„ã°âÿâ‘â…ã°âÿâ’â¦ã°âÿâ’â¦ã°âÿâ’â¦"
[5] "factsguid societi now motiv"
[6] "huge fan fare big talk leav chao pay disput get allshowandnogo"
并注意到这些字符:
> ã°âÿâ“â± ã°âÿâ˜â™ã°âÿâ˜âžã°âÿâ‘â„ã°âÿâ‘â…ã°âÿâ’â¦ã°âÿâ’â¦ã°âÿâ’â¦
根据我读到的博客,这些是 UTF-8。我尝试使用以下方法处理它:
raw_tweets$tweet <- iconv(raw_tweets$tweet, "ASCII", "UTF-8", sub="")
但是遇到了这个异常:
Error in iconv(raw_tweets$tweet, "ASCII", "UTF-8", sub = "") :
embedded nul in string: '#model i love u take with u all the time in urC[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]4C[=13=]2B1!!! C[=13=]3B0C[=13=]2E8C[=13=]2K4C[=13=]2b[=13=]4"C[=13=]3B0C[=13=]2E8C[=13=]2K4C[=13=]2E=C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]0C[=13=]2b[=13=]6C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]0C[=13=]2b[=13=]&C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]1C[=13=]2B&C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]1C[=13=]2B&C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]1C[=13=]2B&'
这些代码是什么?我该如何处理这些代码?是否有任何经验法则来处理此类非结构化文本?
我的推文中有一些非 ASCII 字符。
使用此代码
tweet_corpus= tm_map(tweet_corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
我能够解决问题。
我正在尝试分析一些推文,并且是文本挖掘的新手。经过基本的预处理后,我的输出是:
> `head(tweet_corpus[[1]]$content)`
[1] "user father dysfunct selfish drag kid dysfunct run"
[2] "user user thank lyft credit use caus offer wheelchair van pdx disapoint getthank"
[3] "bihday majesti"
[4] "model love u take u time urã°âÿâ“â± ã°âÿâ˜â™ã°âÿâ˜âžã°âÿâ‘â„ã°âÿâ‘â…ã°âÿâ’â¦ã°âÿâ’â¦ã°âÿâ’â¦"
[5] "factsguid societi now motiv"
[6] "huge fan fare big talk leav chao pay disput get allshowandnogo"
并注意到这些字符:
> ã°âÿâ“â± ã°âÿâ˜â™ã°âÿâ˜âžã°âÿâ‘â„ã°âÿâ‘â…ã°âÿâ’â¦ã°âÿâ’â¦ã°âÿâ’â¦
根据我读到的博客,这些是 UTF-8。我尝试使用以下方法处理它:
raw_tweets$tweet <- iconv(raw_tweets$tweet, "ASCII", "UTF-8", sub="")
但是遇到了这个异常:
Error in iconv(raw_tweets$tweet, "ASCII", "UTF-8", sub = "") :
embedded nul in string: '#model i love u take with u all the time in urC[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]4C[=13=]2B1!!! C[=13=]3B0C[=13=]2E8C[=13=]2K4C[=13=]2b[=13=]4"C[=13=]3B0C[=13=]2E8C[=13=]2K4C[=13=]2E=C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]0C[=13=]2b[=13=]6C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]0C[=13=]2b[=13=]&C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]1C[=13=]2B&C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]1C[=13=]2B&C[=13=]3B0C[=13=]2E8C[=13=]2b[=13=]1C[=13=]2B&'
这些代码是什么?我该如何处理这些代码?是否有任何经验法则来处理此类非结构化文本?
我的推文中有一些非 ASCII 字符。 使用此代码
tweet_corpus= tm_map(tweet_corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
我能够解决问题。