词形还原后删除 space

Question

我简单地对一个字符向量进行了词形还原。问题在于词形还原在由破折号统一的单词之间创建了 space（例如 short-term 变为 short - term） .我的字符向量中充满了这些词，所以我想找到一种方法来消除这种失真。

举个例子：

text <- c("Whosebug is a great website where you can find great and very skilled people who are so kind to solve your coding problems. In the short-term is a very good thing because you can speed up your research, in the long-term is better if you learn how to code on your own. Let me add more non-sense to make my point. The growth-friendly composition of public finance is a good thing.")

ch_vector <- lemmatize_strings(text)

正如我之前所说，结果是这样的：

"Whosebug be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short - term** be a very good thing because you can speed up your research, in the **long - term** be good if you learn how to code on your own. Let me add much **non - sense** to make my point. The **growth - friendly** composition of public finance be a good thing."

相反，我想要这个：

"Whosebug be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short-term** be a very good thing because you can speed up your research, in the **long-term** be good if you learn how to code on your own. Let me add much **non-sense** to make my point. The **growth-friendly** composition of public finance be a good thing."

到目前为止，我对每个感兴趣的词都是这样处理的：

ch <- sub(pattern = "growth - friendly", replacement = "growth-friendly", x = ch_vector, fixed = TRUE)

但老实说，它很耗时，效率低下，而且并不总是能正常工作（取决于大写字母等）

你能推荐一个更好的方法吗？

非常感谢

Answer 1

x <- "Whosebug be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short - term** be a very good thing because you can speed up your research, in the **long - term** be good if you learn how to code on your own. Let me add much **non - sense** to make my point. The **growth - friendly** composition of public finance be a good thing."

使用函数 gsub() 将所有破折号替换为单个破折号周围的空格似乎可以轻松完成您想要的工作。

gsub(" - ","-",x)

# [1] "Whosebug be a great website where you can find great and very skill people
# who be so kind to solve your code problem. In the **short-term** be a very good thing
# because you can speed up your research, in the **long-term** be good if you learn how to
# code on your own. Let me add much **non-sense** to make my point. The 
# **growth-friendly** composition of public finance be a good thing."

但是，我不确定这将如何与 textstem 包的设计用途相互作用，因此这可能会或可能不会满足您的需求。

词形还原后删除 space

Remove space after lemmatization

r

lemmatization