在标记化之前删除数字、标点符号、空格
Remove Numbers, Punctuations, White Spaces before Tokenization
我有以下数据框
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
根据之前的编码帮助,我们可以使用以下代码删除停用词。
report$Text <- gsub(paste0('\b',tm::stopwords("english"), '\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
以上数据还是有噪声(数字、标点、白色space)。需要在标记化之前通过去除这些噪声来获取以下格式的数据。此外,我想删除选定的停用词(例如,saw
和 kitty
)。
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 hey hei hei wood 4
5 hello best 5
我们可能会得到 tm::stopwords
的 union
和新条目,paste
它们与 collapse = "|"
,删除那些替换为 ""
在 gsub
,同时删除标点符号和数字以及多余的空格(\s+
- 一个或多个空格)
trimws(gsub("\s+", " ",
gsub(paste0("\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\b"), "",
gsub("[[:punct:]0-9]+", "", report$Text))
))
-输出
[1] "unit crosses street"
[2 "driver speeding driver"
[3] "year year pandemic"
[4] "hey hei hei wood"
[5] "hello best"
我有以下数据框
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
根据之前的编码帮助,我们可以使用以下代码删除停用词。
report$Text <- gsub(paste0('\b',tm::stopwords("english"), '\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
以上数据还是有噪声(数字、标点、白色space)。需要在标记化之前通过去除这些噪声来获取以下格式的数据。此外,我想删除选定的停用词(例如,saw
和 kitty
)。
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 hey hei hei wood 4
5 hello best 5
我们可能会得到 tm::stopwords
的 union
和新条目,paste
它们与 collapse = "|"
,删除那些替换为 ""
在 gsub
,同时删除标点符号和数字以及多余的空格(\s+
- 一个或多个空格)
trimws(gsub("\s+", " ",
gsub(paste0("\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\b"), "",
gsub("[[:punct:]0-9]+", "", report$Text))
))
-输出
[1] "unit crosses street"
[2 "driver speeding driver"
[3] "year year pandemic"
[4] "hey hei hei wood"
[5] "hello best"