是否有一种算法可以删除两个单词之间的破折号(“-”)然后将它们收缩?
Is there an algorithm for removing a dash ("-") between two words and then contracting them?
我有很多单词的文本,换行之间有破折号,如下所示:
vec <- "Today is a good day because the sun is shin- ing."
我想要的是:
"Today is a good day because the sun is shining."
但我不希望它只针对特定的词,而是针对所有像这样被“分解”的词。好像应该可以用Word格式做的,但是我没弄清楚怎么做,所以可能比较复杂。
郑重声明,我正在使用 readtext
/quanteda
包,但我也找不到任何至少默认情况下可以做到这一点的东西。
有没有一些简单的方法可以做到这一点?
这是一种方法。我们可以使用 stringr
包中的 str_replace_all
。
vec <- "Today is a good day because the sun is shin- ing."
library(stringr)
vec2 <- str_replace_all(vec, "-\s+", "")
vec2
# [1] "Today is a good day because the sun is shining."
在创建 quanteda 对象(语料库、标记等)之前在字符输入中修复此问题当然是一个很好的解决方案。 quanteda 中的另一种方法是用结尾的连字符标记文本,然后:
- 将带连字符的标记与后面的标记组合起来
- 删除带有内部连字符的新标记
示例:
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
"The sun is shin- ing.",
"Hyphen- ation is fun",
"text an- alysis"
)
toks <- tokens(txt)
toks
## Tokens consisting of 3 documents.
## text1 :
## [1] "The" "sun" "is" "shin-" "ing" "."
##
## text2 :
## [1] "Hyphen-" "ation" "is" "fun"
##
## text3 :
## [1] "text" "an-" "alysis"
合成步骤:
toksc <- tokens_compound(toks, phrase("*- *"), concatenator = "")
toksc
## Tokens consisting of 3 documents.
## text1 :
## [1] "The" "sun" "is" "shin-ing" "."
##
## text2 :
## [1] "Hyphen-ation" "is" "fun"
##
## text3 :
## [1] "text" "an-alysis"
最后是不带连字符的替换步骤:
toks_hyphenated <- grep("\w+-\w+", types(toksc), value = TRUE)
tokens_replace(toksc, toks_hyphenated, gsub("-", "", toks_hyphenated))
## Tokens consisting of 3 documents.
## text1 :
## [1] "The" "sun" "is" "shining" "."
##
## text2 :
## [1] "Hyphenation" "is" "fun"
##
## text3 :
## [1] "text" "analysis"
编辑:添加到问题
如果你真的想重新组合这些以从处理过的标记中制作语料库,你可以应用此步骤:
> toks_rejoined <- tokens_replace(toksc, toks_hyphenated, gsub("-", "",
> corpus(sapply(toks_rejoined, paste, collapse = " "))
Corpus consisting of 3 documents.
text1 :
"The sun is shining ."
text2 :
"Hyphenation is fun"
text3 :
"text analysis"
我有很多单词的文本,换行之间有破折号,如下所示:
vec <- "Today is a good day because the sun is shin- ing."
我想要的是:
"Today is a good day because the sun is shining."
但我不希望它只针对特定的词,而是针对所有像这样被“分解”的词。好像应该可以用Word格式做的,但是我没弄清楚怎么做,所以可能比较复杂。
郑重声明,我正在使用 readtext
/quanteda
包,但我也找不到任何至少默认情况下可以做到这一点的东西。
有没有一些简单的方法可以做到这一点?
这是一种方法。我们可以使用 stringr
包中的 str_replace_all
。
vec <- "Today is a good day because the sun is shin- ing."
library(stringr)
vec2 <- str_replace_all(vec, "-\s+", "")
vec2
# [1] "Today is a good day because the sun is shining."
在创建 quanteda 对象(语料库、标记等)之前在字符输入中修复此问题当然是一个很好的解决方案。 quanteda 中的另一种方法是用结尾的连字符标记文本,然后:
- 将带连字符的标记与后面的标记组合起来
- 删除带有内部连字符的新标记
示例:
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
"The sun is shin- ing.",
"Hyphen- ation is fun",
"text an- alysis"
)
toks <- tokens(txt)
toks
## Tokens consisting of 3 documents.
## text1 :
## [1] "The" "sun" "is" "shin-" "ing" "."
##
## text2 :
## [1] "Hyphen-" "ation" "is" "fun"
##
## text3 :
## [1] "text" "an-" "alysis"
合成步骤:
toksc <- tokens_compound(toks, phrase("*- *"), concatenator = "")
toksc
## Tokens consisting of 3 documents.
## text1 :
## [1] "The" "sun" "is" "shin-ing" "."
##
## text2 :
## [1] "Hyphen-ation" "is" "fun"
##
## text3 :
## [1] "text" "an-alysis"
最后是不带连字符的替换步骤:
toks_hyphenated <- grep("\w+-\w+", types(toksc), value = TRUE)
tokens_replace(toksc, toks_hyphenated, gsub("-", "", toks_hyphenated))
## Tokens consisting of 3 documents.
## text1 :
## [1] "The" "sun" "is" "shining" "."
##
## text2 :
## [1] "Hyphenation" "is" "fun"
##
## text3 :
## [1] "text" "analysis"
编辑:添加到问题
如果你真的想重新组合这些以从处理过的标记中制作语料库,你可以应用此步骤:
> toks_rejoined <- tokens_replace(toksc, toks_hyphenated, gsub("-", "",
> corpus(sapply(toks_rejoined, paste, collapse = " "))
Corpus consisting of 3 documents.
text1 :
"The sun is shining ."
text2 :
"Hyphenation is fun"
text3 :
"text analysis"