在标记化之前删除小于特定字符长度的单词和降噪

Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization

我有以下数据框

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
report 
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

根据之前的编码帮助,我们可以使用以下代码删除停用词。

report$Text <- gsub(paste0('\b',tm::stopwords("english"), '\b', 
                          collapse = '|'), '', report$Text)
report
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

我想去除小于一定字符长度的单词(例如,想要去除小于4个字符的单词,如heihey)。另外需要在标记化之前删除手动停用词(例如 sawkitty)和常见噪音(空格、数字和标点符号)。最终结果将是:

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                                   wood  4
5                             hello best  5

关于噪声和手动停用词的类似问题已发布

使用前面的代码,如果我们从删除 nchar 小于或等于 3(gsubfn)的单词开始,它应该可以工作

trimws(gsub(paste0("\b(", paste(union(c("saw", "kityy"), 
   tm::stopwords("english")), collapse="|"), ")\b"), "", 
     gsub("[[:punct:]0-9]+", "",gsubfn("\w+", function(x) 
     if(nchar(x) > 3) x else '', report$Text))))))

-输出

[1] "unit crosses street"    "driver speeding driver" 
[3] "year year pandemic"     "wood"                   "hello best"