使用 readLines 和 R 中的 tm-package 清理网页文本

Question

我正在尝试使用 readLines 函数删除网页上的正则表达式代码和数字。我正在使用 unlist 函数来处理其中的一些。但是，我不确定如何删除数字。我正在考虑使用 tm-package，但我似乎缺少格式转换。我如何转换我的网页以使用 tm 删除数字等，或者是否有更简单的方法从文本中删除冗余？我希望将多个网页串联起来阅读，这样会比较干净。

 library(rvest)
 library(tm)
 webpage <- readLines("https://www.sciencedaily.com/releases/2020/02/200219113746.htm", 
             encoding = "UCS-2LE")
 dirtytext <- unlist(strsplit(webpage,"\r|\n|\t"))
 cleantext <- tm_map(dirtytext,removeNumbers)

最后一行给出错误信息：

'Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"'

Answer 1

我不确定你是否想包含导语，但下面是 returns 逐段的故事（它删除了文本中包含的所有非故事元素，如广告）。

library(rvest)

url <- "https://www.sciencedaily.com/releases/2020/02/200219113746.htm"

page <- read_html(url)

story <- page %>%
  html_nodes("div#text p") %>%  # use "div#story_text p" to include lede
  html_text

使用 readLines 和 R 中的 tm-package 清理网页文本

Cleaning web text using readLines and the tm-package in R

url

nlp

r

readlines

tm