我有一个包含 html 代码的文本文件。这会导致导入时出错

Question

我正在尝试导入一个包含 html 代码的文本文件。我正在尝试使用 read.table 导入并用波浪线 (~) 分隔。

文本文件如下所示：

id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>

<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>

我使用的代码让我很接近：

text <- read.table("filepath/text_file.txt",
                    quote = "\"",
                    sep = "~",
                    fill = TRUE,
                    header = TRUE,
                    comment.char = "",
                    stringsAsFactors = TRUE,
                    na.strings = "\n",
                    allowEscapes = FALSE)

我得到：

id              title       content
Article-123     Title 1     <h2>Overview of Article 1</h2>
Article-456     Title 2     <h1>Problem:</h1><br>
<br>
Article-567     Title 3     <h1>This is the content of article 789 </h1>

当我导入 R 时，html 似乎添加了一个换行符。相反，我希望导入看起来像这样：

id              title       content
Article-123     Title 1     <h2>Overview of Article 1</h2>
Article-456     Title 2     <h1>Problem:</h1><br>
Article-567     Title 3     <h1>This is the content of article 789 </h1>

Answer 1

你看看这行不行？我不确定如何让 read.table 考虑一些换行符而不是其他换行符（你怎么知道换行符是否意味着新行？）相反，我们可以尝试以下方法：

以行的形式读入数据（因此文本的每一行都是字符向量的一个元素）
通过查找 ~ 个字符找出每一行属于哪些行，然后折叠这些行，替换换行符。如果 HTML 在任何地方包含 ~，则可能很脆弱。
使用separate将新整理的行拆分为三列。

library(tidyverse)
text <- "id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>

<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>"

text_in <- read_lines(text) %>%
  tibble(line = .) %>%
  mutate(row = str_detect(line, "~") %>% cumsum) %>%
  group_by(row) %>%
  summarise(line = str_c(line, collapse = "\n")) %>%
  separate(line, into = c("id", "title", "content"), sep = "~") %>%
  slice(-1)

text_in
#> # A tibble: 3 x 4
#>     row id        title   content                                          
#>   <int> <chr>     <chr>   <chr>                                            
#> 1     2 Article-… Title 1 "<h2>Overview of Article 1</h2>\n\n<p>This is th…
#> 2     3 Article-… Title 2 "<h1>Problem:</h1><br>\n<br>\nThis is the conten…
#> 3     4 Article-… Title 3 <h1>This is the content of article 789 </h1>

^{由 reprex package (v0.2.1)}

于 2019-04-17 创建

Answer 2

如果您正在使用 data.tables，您可以试试这个。我的方法有以下假设：

如果列（"title" 或 "content"）具有 NA，则该行是 <br>、comment 或 <p>
文本文件中将包含更多这些行

鉴于这些假设，如果您使用 library(readr)，它将创建一个 tibble table，您可以将其设置回 data.table，同时删除任何行NA.

这是代码：

text <- "id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>

<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>"

library(readr)
library(data.table)
test <- na.omit(setDT(read_delim(text, delim = "~")))

test


            id   title                                      content
1: Article-123 Title 1               <h2>Overview of Article 1</h2>
2: Article-456 Title 2                        <h1>Problem:</h1><br>
3: Article-789 Title 3 <h1>This is the content of article 789 </h1>

我添加这个是因为我喜欢使用 data.tables 所以使用 fread 您还可以执行以下操作：

library(data.table)
test <- na.omit(fread(text,header = TRUE, sep = "~", 
                      na.strings = "", fill = TRUE, 
                      blank.lines.skip = TRUE))


test
            id   title                                      content
1: Article-123 Title 1               <h2>Overview of Article 1</h2>
2: Article-456 Title 2                        <h1>Problem:</h1><br>
3: Article-789 Title 3 <h1>This is the content of article 789 </h1>

我有一个包含 html 代码的文本文件。这会导致导入时出错

I have a text file that has html code in it. This causes errors while importing

r

read.table