如何在 read.table 中计算单个 apostrophe/quotation?

How to account for single apostrophe/quotation in read.table?

我有以下数据框:

1            1                                        What percent of the world\xd5s population is between 15 and 64 years old?
2            2                                               What percent of the world\xd5s airports are in the United States? 
3            3                                            The area of the USA is what percent of the area of the Pacific Ocean?
4            4                                                      What percent of the earth\xd5s surface is covered by water?
5            5 What percent of the goods exported worldwide are mineral fuels (including oil, coal, gas, and refined products)?
6            6                    What percent of the world\xd5s countries have a higher fertility rate than the United States?
7            7                        What percent of the worldwide gross domestic product (GDP) comes from the service sector?
8            8                                    What percent of the worldwide income does the richest 10% of households earn?
9            9      What percent of the worldwide gross domestic product (GDP) is re-invested (\xd2gross fixed investment\xd3)?
10          10                                      What percent of the worldwide labor force works in the agricultural sector?
11          11                                             What percent of the worldwide land mass is not used for agriculture?
12          12                           What percent of the world\xd5s population speaks Mandarin Chinese as a first language?
13          13                What percentage of the world\xd5s countries have a higher life expectancy than the United States?
14          14                             What percent of the world\xd5s population aged 15 years or older can read and write?
15          15      What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)?
16          16                                                    Saudi Arabia consumes what percentage of the oil it produces?
17          17                   What percent of the world\xd5s population lives in either China, India, or the European Union?
18          18                                                          What percent of the world\xd5s population is Christian?
19          19                                                               What percent of the world\xd5s roads are in India?
20          20                         What percent of the world\xd5s telephone lines are in China, USA, or the European Union?

每个问题中的所有格词(例如 world'searth's)都应该有一个撇号,正如您所见,它的读法与我想要的不同。我正在尝试这样的表达式 DF <- read.table("mydata.csv", header=TRUE, sep="\t", quote="") 无济于事。令人惊讶的是,要找到这个问题的答案极其困难。

如果无法通过选择更好的 read-in 方法来解决这个问题,那么可以使用正则表达式来解决;例如:

x <- "What percent of the world\xd5s population"
gsub("\\xd5", "'", x)
[1] "What percent of the world's population"

您似乎还有其他不幸的撇号转换;这些可以通过替代模式来解决(但是,有趣的是,不是通过正则表达式缩写,例如 \d 表示数字)

x <- c("What percent of the world\xd5s population", 
       "gross domestic product (GDP) is re-invested (\xd2gross fixed investment\xd3)")
gsub("\\xd5|\\xd2|\\xd3", "'", x)
[1] "What percent of the world's population"                                
[2] "gross domestic product (GDP) is re-invested ('gross fixed investment')"

您可以使用 readLines 阅读 table 并利用前两列似乎总是有 14 个字符这一事实。

r <- trimws(readLines(file("mydata.csv")))

res <- data.frame(do.call(rbind, strsplit(substring(r, 1, 14), "\s+")), 
                  X3=trimws(substring(r, 15, nchar(r))))

然后进行清洁。

within(res, {
  X1 <- as.numeric(X1)
  X2 <- as.numeric(X2)
  X3 <- gsub("\\xd5", "'", X3)
  X3 <- gsub("\\xd2|\\xd3", '"', X3)
})
#    X1 X2                                                                                                               X3
# 1   1  1                                           What percent of the world's population is between 15 and 64 years old?
# 2   2  2                                                   What percent of the world's airports are in the United States?
# 3   3  3                                            The area of the USA is what percent of the area of the Pacific Ocean?
# 4   4  4                                                         What percent of the earth's surface is covered by water?
# 5   5  5 What percent of the goods exported worldwide are mineral fuels (including oil, coal, gas, and refined products)?
# 6   6  6                       What percent of the world's countries have a higher fertility rate than the United States?
# 7   7  7                        What percent of the worldwide gross domestic product (GDP) comes from the service sector?
# 8   8  8                                    What percent of the worldwide income does the richest 10% of households earn?
# 9   9  9            What percent of the worldwide gross domestic product (GDP) is re-invested ("gross fixed investment")?
# 10 10 10                                      What percent of the worldwide labor force works in the agricultural sector?
# 11 11 11                                             What percent of the worldwide land mass is not used for agriculture?
# 12 12 12                              What percent of the world's population speaks Mandarin Chinese as a first language?
# 13 13 13                   What percentage of the world's countries have a higher life expectancy than the United States?
# 14 14 14                                What percent of the world's population aged 15 years or older can read and write?
# 15 15 15      What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)?
# 16 16 16                                                    Saudi Arabia consumes what percentage of the oil it produces?
# 17 17 17                      What percent of the world's population lives in either China, India, or the European Union?
# 18 18 18                                                             What percent of the world's population is Christian?
# 19 19 19                                                                  What percent of the world's roads are in India?
# 20 20 20                            What percent of the world's telephone lines are in China, USA, or the European Union?

字符串

What percent of the world\xd5s population is between 15 and 64 years old?

很可能是读取包含 non-ASCII 个字符的文本文件的结果。在这里,序列 \xd5 表示文件使用的任何编码的左单引号,而不是 4 个字符 \ x d 5。同样,\xd2\xd3分别代表左右双引号。所以您的文件被正确读取,只是没有按照您期望的方式打印。

要将 \xd5 转换为常规 ASCII 引号:

gsub("\xd5", "'", x)  # no extra backslashes needed

同理,将\xd2\xd3转成ASCII双引号:

gsub("\xd2|\xd3", '"', x)

(如果您使用的 R < 4.0 版本,您还应该使用 read.table(*, stringsAsFactors=FALSE) 读取数据。)

我最终找到了 DF1 <- read.csv("mydata.csv", header=TRUE, sep=",", quote="")

的答案