在标记化之前删除小于特定字符长度的单词和降噪
Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization
我有以下数据框
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
根据之前的编码帮助,我们可以使用以下代码删除停用词。
report$Text <- gsub(paste0('\b',tm::stopwords("english"), '\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
我想去除小于一定字符长度的单词(例如,想要去除小于4个字符的单词,如hei
和hey
)。另外需要在标记化之前删除手动停用词(例如 saw
和 kitty
)和常见噪音(空格、数字和标点符号)。最终结果将是:
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 wood 4
5 hello best 5
关于噪声和手动停用词的类似问题已发布 。
使用前面的代码,如果我们从删除 nchar
小于或等于 3(gsubfn
)的单词开始,它应该可以工作
trimws(gsub(paste0("\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\b"), "",
gsub("[[:punct:]0-9]+", "",gsubfn("\w+", function(x)
if(nchar(x) > 3) x else '', report$Text))))))
-输出
[1] "unit crosses street" "driver speeding driver"
[3] "year year pandemic" "wood" "hello best"
我有以下数据框
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
根据之前的编码帮助,我们可以使用以下代码删除停用词。
report$Text <- gsub(paste0('\b',tm::stopwords("english"), '\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
我想去除小于一定字符长度的单词(例如,想要去除小于4个字符的单词,如hei
和hey
)。另外需要在标记化之前删除手动停用词(例如 saw
和 kitty
)和常见噪音(空格、数字和标点符号)。最终结果将是:
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 wood 4
5 hello best 5
关于噪声和手动停用词的类似问题已发布
使用前面的代码,如果我们从删除 nchar
小于或等于 3(gsubfn
)的单词开始,它应该可以工作
trimws(gsub(paste0("\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\b"), "",
gsub("[[:punct:]0-9]+", "",gsubfn("\w+", function(x)
if(nchar(x) > 3) x else '', report$Text))))))
-输出
[1] "unit crosses street" "driver speeding driver"
[3] "year year pandemic" "wood" "hello best"