我有一个包含 html 代码的文本文件。这会导致导入时出错
I have a text file that has html code in it. This causes errors while importing
我正在尝试导入一个包含 html 代码的文本文件。我正在尝试使用 read.table
导入并用波浪线 (~) 分隔。
文本文件如下所示:
id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>
<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>
我使用的代码让我很接近:
text <- read.table("filepath/text_file.txt",
quote = "\"",
sep = "~",
fill = TRUE,
header = TRUE,
comment.char = "",
stringsAsFactors = TRUE,
na.strings = "\n",
allowEscapes = FALSE)
我得到:
id title content
Article-123 Title 1 <h2>Overview of Article 1</h2>
Article-456 Title 2 <h1>Problem:</h1><br>
<br>
Article-567 Title 3 <h1>This is the content of article 789 </h1>
当我导入 R 时,html 似乎添加了一个换行符。相反,我希望导入看起来像这样:
id title content
Article-123 Title 1 <h2>Overview of Article 1</h2>
Article-456 Title 2 <h1>Problem:</h1><br>
Article-567 Title 3 <h1>This is the content of article 789 </h1>
你看看这行不行?我不确定如何让 read.table
考虑一些换行符而不是其他换行符(你怎么知道换行符是否意味着新行?)相反,我们可以尝试以下方法:
- 以行的形式读入数据(因此文本的每一行都是字符向量的一个元素)
- 通过查找
~
个字符找出每一行属于哪些行,然后折叠这些行,替换换行符。如果 HTML 在任何地方包含 ~
,则可能很脆弱。
- 使用
separate
将新整理的行拆分为三列。
library(tidyverse)
text <- "id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>
<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>"
text_in <- read_lines(text) %>%
tibble(line = .) %>%
mutate(row = str_detect(line, "~") %>% cumsum) %>%
group_by(row) %>%
summarise(line = str_c(line, collapse = "\n")) %>%
separate(line, into = c("id", "title", "content"), sep = "~") %>%
slice(-1)
text_in
#> # A tibble: 3 x 4
#> row id title content
#> <int> <chr> <chr> <chr>
#> 1 2 Article-… Title 1 "<h2>Overview of Article 1</h2>\n\n<p>This is th…
#> 2 3 Article-… Title 2 "<h1>Problem:</h1><br>\n<br>\nThis is the conten…
#> 3 4 Article-… Title 3 <h1>This is the content of article 789 </h1>
由 reprex package (v0.2.1)
于 2019-04-17 创建
如果您正在使用 data.tables,您可以试试这个。我的方法有以下假设:
- 如果列(
"title"
或 "content"
)具有 NA
,则该行是 <br>
、comment
或 <p>
- 文本文件中将包含更多这些行
鉴于这些假设,如果您使用 library(readr)
,它将创建一个 tibble
table,您可以将其设置回 data.table
,同时删除任何行NA
.
这是代码:
text <- "id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>
<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>"
library(readr)
library(data.table)
test <- na.omit(setDT(read_delim(text, delim = "~")))
test
id title content
1: Article-123 Title 1 <h2>Overview of Article 1</h2>
2: Article-456 Title 2 <h1>Problem:</h1><br>
3: Article-789 Title 3 <h1>This is the content of article 789 </h1>
我添加这个是因为我喜欢使用 data.tables
所以使用 fread
您还可以执行以下操作:
library(data.table)
test <- na.omit(fread(text,header = TRUE, sep = "~",
na.strings = "", fill = TRUE,
blank.lines.skip = TRUE))
test
id title content
1: Article-123 Title 1 <h2>Overview of Article 1</h2>
2: Article-456 Title 2 <h1>Problem:</h1><br>
3: Article-789 Title 3 <h1>This is the content of article 789 </h1>
我正在尝试导入一个包含 html 代码的文本文件。我正在尝试使用 read.table
导入并用波浪线 (~) 分隔。
文本文件如下所示:
id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>
<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>
我使用的代码让我很接近:
text <- read.table("filepath/text_file.txt",
quote = "\"",
sep = "~",
fill = TRUE,
header = TRUE,
comment.char = "",
stringsAsFactors = TRUE,
na.strings = "\n",
allowEscapes = FALSE)
我得到:
id title content
Article-123 Title 1 <h2>Overview of Article 1</h2>
Article-456 Title 2 <h1>Problem:</h1><br>
<br>
Article-567 Title 3 <h1>This is the content of article 789 </h1>
当我导入 R 时,html 似乎添加了一个换行符。相反,我希望导入看起来像这样:
id title content
Article-123 Title 1 <h2>Overview of Article 1</h2>
Article-456 Title 2 <h1>Problem:</h1><br>
Article-567 Title 3 <h1>This is the content of article 789 </h1>
你看看这行不行?我不确定如何让 read.table
考虑一些换行符而不是其他换行符(你怎么知道换行符是否意味着新行?)相反,我们可以尝试以下方法:
- 以行的形式读入数据(因此文本的每一行都是字符向量的一个元素)
- 通过查找
~
个字符找出每一行属于哪些行,然后折叠这些行,替换换行符。如果 HTML 在任何地方包含~
,则可能很脆弱。 - 使用
separate
将新整理的行拆分为三列。
library(tidyverse)
text <- "id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>
<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>"
text_in <- read_lines(text) %>%
tibble(line = .) %>%
mutate(row = str_detect(line, "~") %>% cumsum) %>%
group_by(row) %>%
summarise(line = str_c(line, collapse = "\n")) %>%
separate(line, into = c("id", "title", "content"), sep = "~") %>%
slice(-1)
text_in
#> # A tibble: 3 x 4
#> row id title content
#> <int> <chr> <chr> <chr>
#> 1 2 Article-… Title 1 "<h2>Overview of Article 1</h2>\n\n<p>This is th…
#> 2 3 Article-… Title 2 "<h1>Problem:</h1><br>\n<br>\nThis is the conten…
#> 3 4 Article-… Title 3 <h1>This is the content of article 789 </h1>
由 reprex package (v0.2.1)
于 2019-04-17 创建如果您正在使用 data.tables,您可以试试这个。我的方法有以下假设:
- 如果列(
"title"
或"content"
)具有NA
,则该行是<br>
、comment
或<p>
- 文本文件中将包含更多这些行
鉴于这些假设,如果您使用 library(readr)
,它将创建一个 tibble
table,您可以将其设置回 data.table
,同时删除任何行NA
.
这是代码:
text <- "id~title~content
Article-123~Title 1~<h2>Overview of Article 1</h2>
<p>This is the content of article 123.</p>
Article-456~Title 2~<h1>Problem:</h1><br>
<br>
This is the content of article 456
Article-789~Title 3~<h1>This is the content of article 789 </h1>"
library(readr)
library(data.table)
test <- na.omit(setDT(read_delim(text, delim = "~")))
test
id title content
1: Article-123 Title 1 <h2>Overview of Article 1</h2>
2: Article-456 Title 2 <h1>Problem:</h1><br>
3: Article-789 Title 3 <h1>This is the content of article 789 </h1>
我添加这个是因为我喜欢使用 data.tables
所以使用 fread
您还可以执行以下操作:
library(data.table)
test <- na.omit(fread(text,header = TRUE, sep = "~",
na.strings = "", fill = TRUE,
blank.lines.skip = TRUE))
test
id title content
1: Article-123 Title 1 <h2>Overview of Article 1</h2>
2: Article-456 Title 2 <h1>Problem:</h1><br>
3: Article-789 Title 3 <h1>This is the content of article 789 </h1>