由于抓取文本的明显编码问题,模式匹配失败

Pattern matching is failing due to apparent encoding problems with scraped text

Google 的摘要编辑(如果可以的话):Grepl 和模式匹配在明显相同的字符串上失败。怀疑的问题是编码刮擦文本的违规行为。真正的问题是 something 在 "nchar." 中没有出现的空间中看不见的、不可见的额外内容解决方案是在尝试模式匹配之前使用 gsub 和 regex 删除所有空格. smingerson 找到了解决方案。

原问题: 我想对使用 rvest 抓取的在线布道集合进行主题建模。

我正在使用模式匹配进行清理和组织,尤其是 grepl。

问题是 grepl 无法匹配明显相同的字符串。抓取的文本是 "unknown" 和 "UTF-8" 编码的混合体。 "Encoding"、"enc2native"、"enc2utf8"、"iconv" 等函数似乎没有帮助,调整 grepl 参数(如 Perl=TRUE 或 useBytes = TRUE)也无济于事。 (并不是说我完全理解所有这些的作用。)

好像有几个帖子是这样的: (1) Troubles with encoding, pattern matching and noisy texts in R (2) https://community.rstudio.com/t/enconding-solution-for-linux-and-windows-10/2055 (3) R on Windows: character encoding hell 和其他人。

关于 #1,我使用的是英语而不是瑞典语,所以我认为更改语言环境不会有帮助。我也不明白归功于 Wiktor 的代码的哪一部分正在解决原始发布者提供的答案中的问题。

关于#2,正如您将在下面看到的,我曾尝试使用 Encoding() 进行更改但没有成功。

我将#3 包括在内是为了证明许多帖子都讨论外语,而我只使用英语。他们还讨论了 Windows10 的难度和 RStudio 中的编码,如果相关的话。

这是我对可重现代码的尝试。不幸的是,错误似乎来自我的原始文件,并且无法通过复制和粘贴以下内容来重现。 Edit #1 下 charToRaw 的不同结果证明了这一点。根据评论,我在 GitHub 上添加了一个文件,该文件在我的会话中加载时包含错误。根据另一条评论,我还添加了库调用,并删除了 "scrapedtitle" 中心的一些空格,因为 Whosebug 格式会在 "author" 变量的中间引入一个换行符。在编辑 #2 的末尾,我还尝试创建一种方法来使用 rawToChar 复制和粘贴有问题的编码,但不能强制转换为 "raw." 在编辑 #3 中,我讨论了用于编码的 RStudio 选项,并描述了不幸的是,我使用不同的编码设置保存了不同的刮取部分,但没有跟踪我使用过的部分。我原以为这些信息是可以恢复和可逆的,但事实并非如此。

#Library calls
library(topicmodels)
library(LDAvis)
library(tm)
library(dplyr)
library(magrittr)
library(stringr)

#The scraped title of a sermon
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

#Extract the author from the title
author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))

#Elsewhere, identify the author from another scraped list of sermons and authors:
scrapedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

#attempted grepl: 
which(grepl(author, scrapedvector)) # only returns 2 when it should return 2 and 5

#Exploring:
typed <-"By Elder Brook P. Hales" #This is typed in from my keyboard

typed == scrapedvector[5] # FALSE unexpectedly

grepl(author, typed) #TRUE as you'd expect
grepl(author, scrapedvector[5]) # FALSE unexpectedly

#Checking encoding
Encoding(scrapedvector) #[1] "unknown" "unknown" "unknown" "unknown" "UTF-8"
Encoding(typed) #[1] "unknown"
Encoding(author) #[1] "unknown"

#Attempting to change the encoding:
Encoding(scrapedvector) <- "UTF-8"
Encoding(scrapedvector) # [1] "unknown" "unknown" "unknown" "unknown" "UTF-8" # No change

编辑 #1:

# Adding charToRaw information: 
charToRaw(typed)
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
charToRaw(scrapedvector[5]) 
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# There's an extra "c2 a0" in the scraped version at the 15th position.

# Results from pasting the vector back into R from this Whosebug post:
repastedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

charToRaw(repastedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
# The repasted string is identical to what I typed, but not to what I saved after scraping.

# Posting this because it is mentioned in other posts
Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

编辑#2

Github 上提供了文件示例: https://github.com/baprisbrey/Whosebug/releases/tag/vA0
该文件是 scrapedTalk2.rds.

这是我将此文件加载到 RStudio 会话时看到的内容:

scrapedTalk <- readRDS("scrapedTalk2.rds")
grepl(author, scrapedTalk) %>% which() # Result is 8.  It should be 8 and 73

scrapedvector2 <- scrapedTalk[c(7,8,18,72,73)] # This is the same as the scrapedvector from above 

Encoding(scrapedTalk)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [12] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [23] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [34] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [45] "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "unknown" "unknown"
 [56] "UTF-8"   "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [67] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [78] "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [89] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"  
[100] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown"
[111] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"  
[122] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown"
[133] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"

scrapedTalk[73] == "By Elder Brook P. Hales" # FALSE, which is unexpected.

charToRaw(scrapedTalk[73]) # for reference
 [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73

# Can I create the troubled encoding by pasting the charToRaw result above?
# Note:  There may be an unintentional newline "/n" character introduced in there due to the length of the string and the Whosebug formatting.  It should be removed.
troubleString <-  "42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73" %>%
                   strsplit(. ,split=" ") %>%  # so far so good
                   unlist %>%                  # no troubles
                   as.raw %>%                  # NA's and 0's introduced
                   rawToChar                   # failure!

编辑#3 因为问题似乎是编码,所以我包括了对 RStudio 编码选项的讨论。在 RStudio File >> Save With Encoding 下是以下带有选项的菜单:

编码有多种选择。我不知道所有这些之间有什么区别。第一个问题是,为什么 Encoding() 不显示所有这些选项? "unknown" 桶肯定涵盖了其中的大部分。其次,由于编码困难,我切换了编码选项,很可能一些被抓取的 material 是使用这些其他编码选项之一保存的。但是,我不记得我尝试过哪些是刮掉的 material 的哪些部分。我认识到这给问题带来了歧义。我想知道为什么我无法恢复正确的编码,转换为另一种编码,但主要是为什么我无法启用 grepl 工作。

值中有某种 space 不合作。经过进一步检查,其中一个似乎有一个额外的 space,尽管在打印时并不明显。下面的第一位显示了如何用单个 space 替换多个 space。第二个展示了如何在进行比较时删除所有类似 space 的字符。

解决方案 1

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
# Replace multiple spaces with a single space.
condensedAuthor <- gsub("\s+", " ", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("\s+", " ", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk)
scrapedTalk[indices]
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

解决方案 2

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
condensedAuthor <- gsub("[[:space:]]", "", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("[[:space:]]", "", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk) # Returns 8 and 73 as `TRUE
scrapedTalk[indices] # Get the corresponding values from the original vector.
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

编辑:我将 \s+ 替换为 space 的正则表达式表示,最终将其替换为 "s",而不是“ ”。我已更新为使用“ ”。