在标记化之前删除数字、标点符号、空格

Question

我有以下数据框

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
report 
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

根据之前的编码帮助，我们可以使用以下代码删除停用词。

report$Text <- gsub(paste0('\b',tm::stopwords("english"), '\b', 
                          collapse = '|'), '', report$Text)
report
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

以上数据还是有噪声（数字、标点、白色space）。需要在标记化之前通过去除这些噪声来获取以下格式的数据。此外，我想删除选定的停用词（例如，saw 和 kitty）。

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                       hey hei hei wood  4
5                             hello best  5

Answer 1

我们可能会得到 tm::stopwords 的 union 和新条目，paste 它们与 collapse = "|"，删除那些替换为 "" 在 gsub，同时删除标点符号和数字以及多余的空格（\s+ - 一个或多个空格）

trimws(gsub("\s+", " ", 
 gsub(paste0("\b(", paste(union(c("saw", "kityy"), 
   tm::stopwords("english")), collapse="|"), ")\b"), "", 
     gsub("[[:punct:]0-9]+", "", report$Text))
))

-输出

[1] "unit crosses street" 
[2  "driver speeding driver" 
[3] "year year pandemic"   
[4] "hey hei hei wood"   
[5] "hello best"

在标记化之前删除数字、标点符号、空格

Remove Numbers, Punctuations, White Spaces before Tokenization

r

text-mining

stop-words

tm

tidytext