在 R 中翻译 android 邮件的编码

Translate encoding of android mail in R

问题

我正在使用 R 包 mRpostman 来使用 R 访问我的邮件帐户。当我通过 Thunderbird 获取从我的计算机发送到专用邮件地址的邮件时,一切正常。但是当我使用我的 Android phone 做同样的事情时,文本被奇怪地编码并且不再清晰可辨。我该如何解决?我试过使用 base64enc::base64decode() 但我无法让它工作。我尝试通过 Encoding().

更改编码,同样失败了

代表

我发了两封邮件。一个来自我使用 Thunderbird 的计算机,文本只是“从计算机上的 Thunderbird 发送”。另一封邮件是使用我的 Android phone 使用默认邮件应用程序发送的。这一个仅包含文本“发自 Android”。

library(mRpostman) # for email communication

# Connect to mail server
imap_mail <- 'imaps://imap.gmail.com' # mail client
user_mail <- keyring::key_get('dataviz-mail')
password_mail <- keyring::key_get('dataviz-mail-password')
# Establish connection to imap server
con <- configure_imap(
  url = imap_mail,
  user = user_mail,
  password = password_mail
)

# Switch to Inbox
con$select_folder('Inbox') 

# Fetch Thunderbird mail
con$fetch_text(11)
#> $text11
#> [1] "Sent from thunderbird on computer\r\n\r\n"

# Fetch Android mail
con$fetch_text(12)
#> $text12
#> [1] "----_com.samsung.android.email_7640956728775490\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Transfer-Encoding: base64\r\n\r\nVGhpcyBtYWlsIGlzIHNlbnQgZnJvbSBBbmRyb2lk\r\n\r\n----_com.samsung.android.email_7640956728775490\r\nContent-Type: text/html; charset=utf-8\r\nContent-Transfer-Encoding: base64\r\n\r\nPGh0bWw+PGhlYWQ+PG1ldGEgaHR0cC1lcXVpdj0iQ29udGVudC1UeXBlIiBjb250ZW50PSJ0ZXh0\r\nL2h0bWw7IGNoYXJzZXQ9VVRGLTgiPjwvaGVhZD48Ym9keSBkaXI9ImF1dG8iPlRoaXMgbWFpbCBp\r\ncyBzZW50IGZyb20gQW5kcm9pZDwvYm9keT48L2h0bWw+\r\n\r\n----_com.samsung.android.email_7640956728775490--\r\n\r\n"

reprex package (v2.0.0)

于 2022-04-06 创建

更新

Allan Cameron 的解决方案有效但删除了换行符

library(tidyverse)
text_that_should_contain_line_breaks <- "----_com.samsung.android.email_6729645824359240\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Transfer-Encoding: base64\r\n\r\naHR0cHM6Ly90d2l0dGVyLmNvbS9jX2dlYmhhcmQvc3RhdHVzLzE1MTA4NjcwMDkxMTM5MjM1ODg/\r\ncz0yMCZ0PWR0X3dvVkV2a3dPSjBfRGZUc2ttZUFIYW5kZHJhd24gZm9udCBoZWFkaW5nVm9uIG1l\r\naW5lbS9tZWluZXIgR2FsYXh5IGdlc2VuZGV0\r\n\r\n----_com.samsung.android.email_6729645824359240\r\nContent-Type: text/html; charset=utf-8\r\nContent-Transfer-Encoding: base64\r\n\r\nPGh0bWw+PGhlYWQ+PG1ldGEgaHR0cC1lcXVpdj0iQ29udGVudC1UeXBlIiBjb250ZW50PSJ0ZXh0\r\nL2h0bWw7IGNoYXJzZXQ9VVRGLTgiPjwvaGVhZD48Ym9keSBkaXI9ImF1dG8iPmh0dHBzOi8vdHdp\r\ndHRlci5jb20vY19nZWJoYXJkL3N0YXR1cy8xNTEwODY3MDA5MTEzOTIzNTg4P3M9MjAmYW1wO3Q9\r\nZHRfd29WRXZrd09KMF9EZlRza21lQTxkaXYgZGlyPSJhdXRvIj48YnI+PC9kaXY+PGRpdiBkaXI9\r\nImF1dG8iPkhhbmRkcmF3biBmb250IGhlYWRpbmc8L2Rpdj48ZGl2IGRpcj0iYXV0byI+PGJyPjwv\r\nZGl2PjxkaXYgaWQ9ImNvbXBvc2VyX3NpZ25hdHVyZSIgZGlyPSJhdXRvIj48ZGl2IHN0eWxlPSJm\r\nb250LXNpemU6MTJweDtjb2xvcjojNTc1NzU3IiBkaXI9ImF1dG8iPlZvbiBtZWluZW0vbWVpbmVy\r\nIEdhbGF4eSBnZXNlbmRldDwvZGl2PjwvZGl2PjxkaXYgZGlyPSJhdXRvIj48YnI+PC9kaXY+PC9i\r\nb2R5PjwvaHRtbD4=\r\n\r\n----_com.samsung.android.email_6729645824359240--\r\n\r\n"
decoded <- text_that_should_contain_line_breaks %>% 
  str_match('base64\r\n\r\n([[:alpha:][:digit:]/\r\n]*)----') %>% 
  .[, 2] %>% 
  base64enc::base64decode() %>% 
  rawToChar()
decoded
#> [1] "https://twitter.com/c_gebhard/status/1510867009113923588?s=20&t=dt_woVEvkwOJ0_DfTskmeAHanddrawn font headingVon meinem/meiner Galaxy gesendet"

# But should be
cat("https://twitter.com/c_gebhard/status/1510867009113923588?s=20&t=dt_woVEvkwOJ0_DfTskmeA\nHanddrawn font heading\nVon meinem/meiner Galaxy gesendet")
#> https://twitter.com/c_gebhard/status/1510867009113923588?s=20&t=dt_woVEvkwOJ0_DfTskmeA
#> Handdrawn font heading
#> Von meinem/meiner Galaxy gesendet

reprex package (v2.0.0)

于 2022-04-11 创建

android 字符串确实包含 base 64 编码的消息,但它嵌入在其他非 base64 编码的文本中,因此您必须提取它。

如果我们从您的问题中提取字符串:

text12 <-  "----_com.samsung.android.email_7640956728775490\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Transfer-Encoding: base64\r\n\r\nVGhpcyBtYWlsIGlzIHNlbnQgZnJvbSBBbmRyb2lk\r\n\r\n----_com.samsung.android.email_7640956728775490\r\nContent-Type: text/html; charset=utf-8\r\nContent-Transfer-Encoding: base64\r\n\r\nPGh0bWw+PGhlYWQ+PG1ldGEgaHR0cC1lcXVpdj0iQ29udGVudC1UeXBlIiBjb250ZW50PSJ0ZXh0\r\nL2h0bWw7IGNoYXJzZXQ9VVRGLTgiPjwvaGVhZD48Ym9keSBkaXI9ImF1dG8iPlRoaXMgbWFpbCBp\r\ncyBzZW50IGZyb20gQW5kcm9pZDwvYm9keT48L2h0bWw+\r\n\r\n----_com.samsung.android.email_7640956728775490--\r\n\r\n"

然后我们可以分割出 base 64 字符串,将其解码为字节并转换为字符,如下所示:

library(dplyr)
library(purrr)
library(base64enc)

text12 %>%
  strsplit("base64\r\n\r\n") %>%
  pluck(1, 2) %>%
  strsplit("----") %>%
  pluck(1, 1) %>%
  gsub(pattern = "[\r\n]+", replacement = "", .) %>%
  base64decode() %>%
  rawToChar()
#> [1] "This mail is sent from Android"

reprex package (v2.0.1)

于 2022-04-06 创建

更新

消息似乎存储了两次:一次是纯文本,第二次是 html-formatted 文本。纯文本中没有实际的换行符,html 只是因为 <br> 标记而有换行符。获取保留换行符的文本的最简单方法是解析 html.

parsed_content <- text_that_should_contain_line_breaks %>%
  strsplit("base64\r\n\r\n") %>%
  pluck(1, 3) %>%
  strsplit("----") %>%
  pluck(1, 1) %>%
  base64decode() %>%
  rawToChar() %>%
  rvest::read_html() %>%
  rvest::html_text2()

例如:

cat(parsed_content)
#> https://twitter.com/c_gebhard/status/1510867009113923588?s=20&t=dt_woVEvkwOJ0_DfTskmeA
#>
#>
#> Handdrawn font heading
#>
#>
#> Von meinem/meiner Galaxy gesendet