使用 R 从文本中删除日期和所有垃圾
Removing dates and all junks from texts using R
我正在使用 R 清理由数万个文本组成的庞大数据集。我知道正则表达式可以方便地完成这项工作,但我不太会用它。我梳理了 Whosebug 但找不到解决方案。这是我的虚拟数据:
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982",
"04/02/2016 Health is a priority: WAI000553",
"09/ 08/2016 Economy is bad: 2031CE8D",
": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")
我想删除所有日期、标点符号和 ID,并希望我的结果是这样的:
[1] "Education is good"
[2] "Health is a priority"
[3] "Economy is bad"
[4] "Vehicle license is needed"
R 中的任何帮助将不胜感激。
使用 stringr 试试这个:
library(stringr)
library(magrittr)
str_remove_all(foo_data, "\/|\d+|\: WO") %>%
str_squish()
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
由 reprex package (v2.0.0)
于 2021-04-22 创建
数据
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\d{4}[[:space:]]+(.*):.*", "\1", foo_data)
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
由 reprex package (v2.0.0)
于 2021-04-22 创建
我认为这里需要具体说明:
首先,让我们删除类似日期的字符串。我将假设 mm/dd/yyyy
或 dd/mm/yyyy
,其中前两位可以是 1-2 位数字,第三位始终是 4 位数字。如果这是可变的,则可以将正则表达式更改为更宽松一些:
foo_data2 <- gsub("\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
从这里开始,缩写似乎很容易删除,正如其他答案所证明的那样。您没有指定缩写是否被硬编码为冒号后的任何内容、前缀为 "WO"
的数字,或者只是字母和数字的某个单词组合。这些可能是:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\bWO\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\b[A-Za-z]+\d+\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
删除 :
应该很简单,使用 trimws(.)
将删除 leading/trailing 个空格。
这显然可以组合成单个正则表达式(使用逻辑 |
和模式分组)或单个 R 调用(嵌套 gsub
)而不复杂,我将它们分开讨论.
我认为 通常是正则表达式的一个很好的参考,请注意,虽然该页面显示了许多带有单反斜杠的正则表达式内容,但 R 要求所有这些都使用双反斜杠(例如,\d
在正则表达式中需要是 \d
在 R) 中。例外情况是如果您使用 R-4 的新原始字符串,其中这两个是相同的:
"\b[A-Za-z]+\d+\b"
r"(\b[A-Za-z]+\d+\b)"
我正在使用 R 清理由数万个文本组成的庞大数据集。我知道正则表达式可以方便地完成这项工作,但我不太会用它。我梳理了 Whosebug 但找不到解决方案。这是我的虚拟数据:
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982",
"04/02/2016 Health is a priority: WAI000553",
"09/ 08/2016 Economy is bad: 2031CE8D",
": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")
我想删除所有日期、标点符号和 ID,并希望我的结果是这样的:
[1] "Education is good"
[2] "Health is a priority"
[3] "Economy is bad"
[4] "Vehicle license is needed"
R 中的任何帮助将不胜感激。
使用 stringr 试试这个:
library(stringr)
library(magrittr)
str_remove_all(foo_data, "\/|\d+|\: WO") %>%
str_squish()
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
由 reprex package (v2.0.0)
于 2021-04-22 创建数据
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\d{4}[[:space:]]+(.*):.*", "\1", foo_data)
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
由 reprex package (v2.0.0)
于 2021-04-22 创建我认为这里需要具体说明:
首先,让我们删除类似日期的字符串。我将假设 mm/dd/yyyy
或 dd/mm/yyyy
,其中前两位可以是 1-2 位数字,第三位始终是 4 位数字。如果这是可变的,则可以将正则表达式更改为更宽松一些:
foo_data2 <- gsub("\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
从这里开始,缩写似乎很容易删除,正如其他答案所证明的那样。您没有指定缩写是否被硬编码为冒号后的任何内容、前缀为 "WO"
的数字,或者只是字母和数字的某个单词组合。这些可能是:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\bWO\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\b[A-Za-z]+\d+\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
删除 :
应该很简单,使用 trimws(.)
将删除 leading/trailing 个空格。
这显然可以组合成单个正则表达式(使用逻辑 |
和模式分组)或单个 R 调用(嵌套 gsub
)而不复杂,我将它们分开讨论.
我认为 通常是正则表达式的一个很好的参考,请注意,虽然该页面显示了许多带有单反斜杠的正则表达式内容,但 R 要求所有这些都使用双反斜杠(例如,\d
在正则表达式中需要是 \d
在 R) 中。例外情况是如果您使用 R-4 的新原始字符串,其中这两个是相同的:
"\b[A-Za-z]+\d+\b"
r"(\b[A-Za-z]+\d+\b)"