如何使用 spacyr 用词元替换单词？

Question

有这样的数据框：

library(spacyr)
df <- data.frame(id = c(102), text = c("the boy's cars are different colors"), stringsAsFactors = FALSE)

可以像这样进行词性标注：

df2 <- spacy_parse(df$text, pos = TRUE, lemma = FALSE)

并使用它将其转换为每个文档一行

df3 <- aggregate(lemma ~ doc_id, df2, paste, collapse = " ")

而不是 doc_id 如何保留 ID？

该过程提供了一个 doc_id 但是我想合并输入数据帧的 ID 并解析数据。

预期输出示例

df <- data.frame(id = c(102), text = c("the boy's car be different color"),
                 stringsAsFactors = FALSE)

Answer 1

你可以这样做。我正在使用 dplyr 而不是 aggregate()，并且我已经添加到您的示例中。

df <- data.frame(
  id = c(102, 103),
  text = c(
    "the boy's cars are different colors",
    "The hare ran faster!"
  ),
  stringsAsFactors = FALSE)

library("spacyr")
library("dplyr", warn.conflicts = FALSE)

spacy_parse(structure(df$text, names = df$id),
  lemma = TRUE, pos = FALSE) %>%
  mutate(id = doc_id) %>%
  group_by(id) %>%
  summarize(text = paste(lemma, collapse = " "))
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.0, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   id    text                             
##   <chr> <chr>                            
## 1 102   the boy 's car be different color
## 2 103   the hare run fast !

如何使用 spacyr 用词元替换单词？

How can I replace words with their lemmas using spacyr?

r

quanteda