(R) 用汉字/UTF-8 和windows10 保存数据(向量或数据帧)
(R) Save data (vector or dataframe) with chinese character/ UTF-8 and windows 10
我正在尝试保存从包含一些中文字符的网站下载的一些数据。我尝试了很多事情都没有成功。 R studio 默认文本编码设置为 UTF-8,windows10 区域也设置为 Beta,使用 unicode UTF-8 进行全球语言支持。
这是重现问题的代码:
##package used
library(jiebaR) ##here for file_coding
library(htm2txt) ## to get the text
library(httr) ## just in case
library(readtext)
##get original text with chinese character
mytxtC <- gettxt("https://archive.li/wip/kRknx")
##print to check that chinese characters appear
mytxtC
##try to save in UTF-8
write.csv(mytxtC, "csv_mytxtC.csv", row.names = FALSE, fileEncoding = "UTF-8")
##check if it is readable
read.csv("csv_mytxtC.csv", encoding = "UTF-8")
##doesn't work, check file encoding
file_coding("csv_mytxtC.csv")
## answer: "windows-1252"
##try with txt
write(mytxtC, "txt_mytxtC.txt")
toto <- readtext("txt_mytxtC.txt")
toto[1,2]
##still not, try file_coding
file_coding("txt_mytxtC.txt")
## "windows-1252" ```
For information
``` Sys.getlocale()
[1] "LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252" ```
我更改了 setLocal,它似乎可以正常工作。
我只是在代码的开头添加了这一行:
Sys.setlocale("LC_CTYPE","chinese")
只需要记住最终将其改回即可。而且,我仍然觉得很奇怪,这条线使使用 UTF-8 进行保存成为可能,而在这之前是不可能的...
这适用于我 Windows :
下载文件:
download.file("https://archive.li/wip/kRknx", destfile="external_file", method="libcurl")
输入文字:
my_text <- readLines("external_file") # readLines(url) works as well
检查 UTF8 :
> sum(validUTF8(my_text)) == length(my_text)
[1] TRUE
您也可以查看文件:
> validUTF8("external_file")
[1] TRUE
这是我在 Windows 上注意到的唯一 difference :
user@somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
对
user@somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines
我正在尝试保存从包含一些中文字符的网站下载的一些数据。我尝试了很多事情都没有成功。 R studio 默认文本编码设置为 UTF-8,windows10 区域也设置为 Beta,使用 unicode UTF-8 进行全球语言支持。 这是重现问题的代码:
##package used
library(jiebaR) ##here for file_coding
library(htm2txt) ## to get the text
library(httr) ## just in case
library(readtext)
##get original text with chinese character
mytxtC <- gettxt("https://archive.li/wip/kRknx")
##print to check that chinese characters appear
mytxtC
##try to save in UTF-8
write.csv(mytxtC, "csv_mytxtC.csv", row.names = FALSE, fileEncoding = "UTF-8")
##check if it is readable
read.csv("csv_mytxtC.csv", encoding = "UTF-8")
##doesn't work, check file encoding
file_coding("csv_mytxtC.csv")
## answer: "windows-1252"
##try with txt
write(mytxtC, "txt_mytxtC.txt")
toto <- readtext("txt_mytxtC.txt")
toto[1,2]
##still not, try file_coding
file_coding("txt_mytxtC.txt")
## "windows-1252" ```
For information
``` Sys.getlocale()
[1] "LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252" ```
我更改了 setLocal,它似乎可以正常工作。
我只是在代码的开头添加了这一行:
Sys.setlocale("LC_CTYPE","chinese")
只需要记住最终将其改回即可。而且,我仍然觉得很奇怪,这条线使使用 UTF-8 进行保存成为可能,而在这之前是不可能的...
这适用于我 Windows :
下载文件:
download.file("https://archive.li/wip/kRknx", destfile="external_file", method="libcurl")
输入文字:
my_text <- readLines("external_file") # readLines(url) works as well
检查 UTF8 :
> sum(validUTF8(my_text)) == length(my_text) [1] TRUE
您也可以查看文件:
> validUTF8("external_file") [1] TRUE
这是我在 Windows 上注意到的唯一 difference :
user@somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
对
user@somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines