删除粘附在 class 个标记的 quanteda 对象的单词上的数字
Remove digits glued to words for quanteda objects of class tokens
可以找到一个相关问题 here,但没有直接解决我在下面讨论的这个问题。
我的目标是删除与标记一起出现的任何数字。例如,我希望能够在以下情况下摆脱数字:13f
、408-k
、10-k
等。我正在使用 quanteda 作为主要工具。我有一个 classic 语料库对象,我使用函数 tokens()
对其进行了标记。在这种情况下,参数 remove_numbers = TRUE
似乎不起作用,因为它只是忽略标记并将它们留在原处。如果我将 tokens_remove()
与特定的正则表达式一起使用,这将删除我想避免的标记,因为我对剩余的文本内容感兴趣。
这里是我如何通过 stringr 中的函数 str_remove_all()
解决问题的最小示例。它可以工作,但对于大对象来说可能会很慢。
我的问题是:有没有办法在不离开 quanteda 的情况下实现相同的结果(例如,在 class tokens
的对象上)?
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(stringr)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "123asd"
#> [11] "and" "well101"
# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
#
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "and"
# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
#
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> $text2
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "asd"
#> [11] "and" "well"
由 reprex package (v0.3.0)
于 2021-02-15 创建
在这种情况下,您可以(滥用)使用 tokens_split
。您在数字上拆分标记,默认情况下 tokens_split
删除分隔符。这样你就可以在quanteda中做任何事情了。
library(quanteda)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE)
tokens_split(mytokens, separator = "[[:digit:]]", valuetype = "regex")
Tokens consisting of 2 documents.
text1 :
[1] "This" "is" "a" "sentence" "with" "correctly" "spaced" "digits" "like"
[10] "K"
text2 :
[1] "This" "is" "a" "sentence" "with" "uncorrectly" "spaced" "digits"
[9] "like" "asd" "and" "well"
另一个答案是 tokens_split()
的巧妙使用,但如果您想要删除单词中间的数字,则不会总是有效(因为它将包含内部数字的原始单词分成两部分)。
这是从类型(唯一 tokens/words)中删除数字字符的有效方法:
library("quanteda")
## Package version: 2.1.2
mytext <- c(
"This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101."
)
toks <- tokens(mytext, remove_punct = TRUE, remove_numbers = TRUE)
# get all types with digits
typesnum <- grep("[[:digit:]]", types(toks), value = TRUE)
typesnum
## [1] "123asd" "well101"
# replace the types with types without digits
tokens_replace(toks, typesnum, gsub("[[:digit:]]", "", typesnum))
## Tokens consisting of 2 documents.
## text1 :
## [1] "This" "is" "a" "sentence" "with" "correctly"
## [7] "spaced" "digits" "like" "K"
##
## text2 :
## [1] "This" "is" "a" "sentence" "with"
## [6] "uncorrectly" "spaced" "digits" "like" "asd"
## [11] "and" "well"
请注意,通常我建议对所有正则表达式操作使用 stringi,但为简单起见,此处使用基本包函数。
由 reprex package (v1.0.0)
于 2021-02-15 创建
可以找到一个相关问题 here,但没有直接解决我在下面讨论的这个问题。
我的目标是删除与标记一起出现的任何数字。例如,我希望能够在以下情况下摆脱数字:13f
、408-k
、10-k
等。我正在使用 quanteda 作为主要工具。我有一个 classic 语料库对象,我使用函数 tokens()
对其进行了标记。在这种情况下,参数 remove_numbers = TRUE
似乎不起作用,因为它只是忽略标记并将它们留在原处。如果我将 tokens_remove()
与特定的正则表达式一起使用,这将删除我想避免的标记,因为我对剩余的文本内容感兴趣。
这里是我如何通过 stringr 中的函数 str_remove_all()
解决问题的最小示例。它可以工作,但对于大对象来说可能会很慢。
我的问题是:有没有办法在不离开 quanteda 的情况下实现相同的结果(例如,在 class tokens
的对象上)?
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(stringr)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "123asd"
#> [11] "and" "well101"
# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
#
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "and"
# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
#
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> $text2
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "asd"
#> [11] "and" "well"
由 reprex package (v0.3.0)
于 2021-02-15 创建在这种情况下,您可以(滥用)使用 tokens_split
。您在数字上拆分标记,默认情况下 tokens_split
删除分隔符。这样你就可以在quanteda中做任何事情了。
library(quanteda)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE)
tokens_split(mytokens, separator = "[[:digit:]]", valuetype = "regex")
Tokens consisting of 2 documents.
text1 :
[1] "This" "is" "a" "sentence" "with" "correctly" "spaced" "digits" "like"
[10] "K"
text2 :
[1] "This" "is" "a" "sentence" "with" "uncorrectly" "spaced" "digits"
[9] "like" "asd" "and" "well"
另一个答案是 tokens_split()
的巧妙使用,但如果您想要删除单词中间的数字,则不会总是有效(因为它将包含内部数字的原始单词分成两部分)。
这是从类型(唯一 tokens/words)中删除数字字符的有效方法:
library("quanteda")
## Package version: 2.1.2
mytext <- c(
"This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101."
)
toks <- tokens(mytext, remove_punct = TRUE, remove_numbers = TRUE)
# get all types with digits
typesnum <- grep("[[:digit:]]", types(toks), value = TRUE)
typesnum
## [1] "123asd" "well101"
# replace the types with types without digits
tokens_replace(toks, typesnum, gsub("[[:digit:]]", "", typesnum))
## Tokens consisting of 2 documents.
## text1 :
## [1] "This" "is" "a" "sentence" "with" "correctly"
## [7] "spaced" "digits" "like" "K"
##
## text2 :
## [1] "This" "is" "a" "sentence" "with"
## [6] "uncorrectly" "spaced" "digits" "like" "asd"
## [11] "and" "well"
请注意,通常我建议对所有正则表达式操作使用 stringi,但为简单起见,此处使用基本包函数。
由 reprex package (v1.0.0)
于 2021-02-15 创建