如何在 R 的文本文件中保存 non-English 个字符?

How to save non-English characters in text file in R?

我正在尝试将一些印地语文本保存到 R 中的文本文件中。

data <- c("चौपाई")
write(data, "data.txt")

输出文件 data.txt 打开时显示 Unicode 字符为 -

<U+091A><U+094C><U+092A><U+093E><U+0908>

而不是印地文文本 चौपाई

我做错了什么?

Rstudio 截图

在 Windows 记事本中打开 data.txt 时的屏幕截图

Notepad +++

中打开 data.txt 时的屏幕截图

Write file as UTF-8 encoding in R for Windows 文章中找到了解决方法(那里有详尽的解释):

BOM <- charToRaw('\xEF\xBB\xBF')

writeUtf8 <- function(x, file, bom=F) {
  Encoding(x) <- "UTF-8"                  # superabundant?
  con <- file(file, "wb")
  if(bom) writeBin(BOM, con, endian="little")
  writeBin(charToRaw(x), con, endian="little")
  close(con)
}

data <- c("चौपाई")
writeUtf8(x=data, file="data.txt")

说明从上述文章中复制粘贴,部分截断):

Difference between Windows and other OSs

I am trying to say as simple as I can. The Windows chooses one of many language sets, however, the Linux and the Mac OS choose one language subset of a UTF-8 set. By this difference, the Windows forgets characters of unselected languages, while other OSs remember characters of all languages.

Problem on Windows

When a text is written to a file, characters of unselected locale languages can not be handled. Some of them are converted into a similar (but incorrect!) character, and others are written as escaped format such as <U+222D>.

Mind that the R is not responsible for this problem. Because the OS’s architecture of switching languages is generating the problem.

… when the R writes a UTF-8 text into a file on Windows, characters of unsupported language are modified. In contrast, all characters are written correctly in Mac OS.

Using binary

There is a solution for this problem. Writing a binary file instead of a text file solves this. All applications handling a UTF-8 file in Windows are using the same trick.

BOM

The BOM should not be used in UTF-8 files. This is what the Linux and the Mac OS are doing. But the Windows Notepad and some applications use the BOM. So, handling the BOM is needed, in spite of grammatically wrong.

The BOM is the 3 bytes character put at the beginning of a text file, but because the R does not use the BOM, it should be removed on reading.

BOM <- charToRaw('\xEF\xBB\xBF')

Write UTF-8 file

writeUtf8 <- function(x, file, bom=F) {
  con <- file(file, "wb")
  if(bom) writeBin(BOM, con, endian="little")
  writeBin(charToRaw(x), con, endian="little")
  close(con)
}

Specify a UTF-8 string as x=, and a file name to write as file=. If you want to read the file only with the Windows Notepad, adding a BOM by the bom=T option is a good choice. Note that this is a minimum script, and not meant to write a very large file.

编辑

请注意添加到 readUtf8readUtf8Text 函数的 encoding 内容 (Encoding(result) <- "UTF-8"):

Reading a UTF-8 is easy, because functions like readLines have encoding= options that accepts UTF-8.

readUtf8Text <- function(file) {
  con <- file(file, 'rt')
  result <- readLines(con, encoding='utf-8')
  close(con)
  Encoding(result) <- "UTF-8"                # important
  result
}

If you want to read a UTF-8 file saved by Windows standard applications like Notepad, you may have a trouble. Because the Windows Notepad appends BOM at writing a UTF-8 file, you must remove the BOM on the R. Or the BOM will appear as a corrupted character at the beginning of the string.

Now, the R 3.0.0 supports UTF-8-BOM encoding to remove the BOM. However, if you want to use R 2.15.3 for a while, you must remove the BOM manually. The following code reads a UTF-8 file as binary and remove the BOM.

Note that this is a minimum script, and not meant to read a very large file.

readUtf8 <- function(file) {
  size <- file.info(file)$size
  con <- file(file, "rb")
  x <- readBin(con, raw(), size, endian="little")
  close(con)
  pstart <- ifelse(all(x[1:3]==BOM), 4, 1)
  pend <- length(x)
  result <- rawToChar(x[pstart:pend])
  Encoding(result) <- "UTF-8"               # important
  result
}

结果

RStudio 1.3 和 RGui 4.0.1 中测试(Windows 10/64bit, i.e. platform x86_64-w64-mingw32`):

> data <- c("चौपाई")
> writeUtf8(x=data, file="data.txt")
> 
> data
[1] "चौपाई"
>
> readUtf8Text(file="data.txt")
[1] "चौपाई"
>
> readUtf8(file="data.txt")
[1] "चौपाई"

证明 Encoding(result) <- "UTF-8" 在两个 阅读 功能中的重要性,以防止 mojibake:

> file <- "data.txt"
> con <- file(file, 'rt')
> result <- readLines(con, encoding='utf-8')
> close(con)
> result                                       # mojibake
[1] "चौपाई"
> Encoding(result) <- "UTF-8"
> result
[1] "चौपाई"
>