使用 R 从文本中删除日期和所有垃圾

Removing dates and all junks from texts using R

我正在使用 R 清理由数万个文本组成的庞大数据集。我知道正则表达式可以方便地完成这项工作,但我不太会用它。我梳理了 Whosebug 但找不到解决方案。这是我的虚拟数据:

foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", 
              "04/02/2016 Health is a priority: WAI000553",
              "09/ 08/2016 Economy is bad: 2031CE8D", 
              ": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")

我想删除所有日期、标点符号和 ID,并希望我的结果是这样的:

[1] "Education is good"        
[2] "Health is a priority"     
[3] "Economy is bad"           
[4] "Vehicle license is needed"

R 中的任何帮助将不胜感激。

使用 stringr 试试这个:

library(stringr)
library(magrittr)

str_remove_all(foo_data, "\/|\d+|\: WO") %>% 
  str_squish()

#> [1] "Education is good"         "Health is a priority"     
#> [3] "Economy is bad"            "Vehicle license is needed"

reprex package (v2.0.0)

于 2021-04-22 创建

数据

foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
              "09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
              "09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\d{4}[[:space:]]+(.*):.*", "\1", foo_data)
#> [1] "Education is good"         "Health is a priority"     
#> [3] "Economy is bad"            "Vehicle license is needed"

reprex package (v2.0.0)

于 2021-04-22 创建

我认为这里需要具体说明:

首先,让我们删除类似日期的字符串。我将假设 mm/dd/yyyydd/mm/yyyy,其中前两位可以是 1-2 位数字,第三位始终是 4 位数字。如果这是可变的,则可以将正则表达式更改为更宽松一些:

foo_data2 <- gsub("\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982"        " Health is a priority: WO0002021"     " Economy is bad: WO001999"            " Vehicle license is needed: WO001050"

从这里开始,缩写似乎很容易删除,正如其他答案所证明的那样。您没有指定缩写是否被硬编码为冒号后的任何内容、前缀为 "WO" 的数字,或者只是字母和数字的某个单词组合。这些可能是:

gsub(":.*", "", foo_data2)
# [1] " Education is good"         " Health is a priority"      " Economy is bad"            " Vehicle license is needed"
gsub("\bWO\S*", "", foo_data2)
# [1] " Education is good: "         " Health is a priority: "      " Economy is bad: "            " Vehicle license is needed: "
gsub("\b[A-Za-z]+\d+\b", "", foo_data2)
# [1] " Education is good: "         " Health is a priority: "      " Economy is bad: "            " Vehicle license is needed: "

删除 : 应该很简单,使用 trimws(.) 将删除 leading/trailing 个空格。

这显然可以组合成单个正则表达式(使用逻辑 | 和模式分组)或单个 R 调用(嵌套 gsub)而不复杂,我将它们分开讨论.

我认为 通常是正则表达式的一个很好的参考,请注意,虽然该页面显示了许多带有单反斜杠的正则表达式内容,但 R 要求所有这些都使用双反斜杠(例如,\d 在正则表达式中需要是 \d 在 R) 中。例外情况是如果您使用 R-4 的新原始字符串,其中这两个是相同的:

"\b[A-Za-z]+\d+\b"
r"(\b[A-Za-z]+\d+\b)"