是否有一种算法可以删除两个单词之间的破折号（“-”）然后将它们收缩？

Question

我有很多单词的文本，换行之间有破折号，如下所示：

vec <- "Today is a good day because the sun is shin- ing."

我想要的是：

"Today is a good day because the sun is shining."

但我不希望它只针对特定的词，而是针对所有像这样被“分解”的词。好像应该可以用Word格式做的，但是我没弄清楚怎么做，所以可能比较复杂。

郑重声明，我正在使用 readtext/quanteda 包，但我也找不到任何至少默认情况下可以做到这一点的东西。

有没有一些简单的方法可以做到这一点？

Answer 1

这是一种方法。我们可以使用 stringr 包中的 str_replace_all。

vec <- "Today is a good day because the sun is shin- ing."

library(stringr)

vec2 <- str_replace_all(vec, "-\s+", "")

vec2
# [1] "Today is a good day because the sun is shining."

Answer 2

在创建 quanteda 对象（语料库、标记等）之前在字符输入中修复此问题当然是一个很好的解决方案。 quanteda 中的另一种方法是用结尾的连字符标记文本，然后：

将带连字符的标记与后面的标记组合起来
删除带有内部连字符的新标记

示例：

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

txt <- c(
  "The sun is shin- ing.",
  "Hyphen- ation is fun",
  "text an- alysis"
)
toks <- tokens(txt)
toks
## Tokens consisting of 3 documents.
## text1 :
## [1] "The"   "sun"   "is"    "shin-" "ing"   "."    
## 
## text2 :
## [1] "Hyphen-" "ation"   "is"      "fun"    
## 
## text3 :
## [1] "text"   "an-"    "alysis"

合成步骤：

toksc <- tokens_compound(toks, phrase("*- *"), concatenator = "")
toksc
## Tokens consisting of 3 documents.
## text1 :
## [1] "The"      "sun"      "is"       "shin-ing" "."       
## 
## text2 :
## [1] "Hyphen-ation" "is"           "fun"         
## 
## text3 :
## [1] "text"      "an-alysis"

最后是不带连字符的替换步骤：

toks_hyphenated <- grep("\w+-\w+", types(toksc), value = TRUE)
tokens_replace(toksc, toks_hyphenated, gsub("-", "", toks_hyphenated))
## Tokens consisting of 3 documents.
## text1 :
## [1] "The"     "sun"     "is"      "shining" "."      
## 
## text2 :
## [1] "Hyphenation" "is"          "fun"        
## 
## text3 :
## [1] "text"     "analysis"

编辑：添加到问题

如果你真的想重新组合这些以从处理过的标记中制作语料库，你可以应用此步骤：

> toks_rejoined <- tokens_replace(toksc, toks_hyphenated, gsub("-", "", 
> corpus(sapply(toks_rejoined, paste, collapse = " "))
Corpus consisting of 3 documents.
text1 :
"The sun is shining ."

text2 :
"Hyphenation is fun"

text3 :
"text analysis"

是否有一种算法可以删除两个单词之间的破折号（“-”）然后将它们收缩？

Is there an algorithm for removing a dash ("-") between two words and then contracting them?

text

r

topic-modeling

quanteda