如何使用 spacyr 用词元替换单词?
How can I replace words with their lemmas using spacyr?
有这样的数据框:
library(spacyr)
df <- data.frame(id = c(102), text = c("the boy's cars are different colors"), stringsAsFactors = FALSE)
可以像这样进行词性标注:
df2 <- spacy_parse(df$text, pos = TRUE, lemma = FALSE)
并使用它将其转换为每个文档一行
df3 <- aggregate(lemma ~ doc_id, df2, paste, collapse = " ")
而不是 doc_id 如何保留 ID?
该过程提供了一个 doc_id 但是我想合并输入数据帧的 ID 并解析数据。
预期输出示例
df <- data.frame(id = c(102), text = c("the boy's car be different color"),
stringsAsFactors = FALSE)
你可以这样做。我正在使用 dplyr 而不是 aggregate()
,并且我已经添加到您的示例中。
df <- data.frame(
id = c(102, 103),
text = c(
"the boy's cars are different colors",
"The hare ran faster!"
),
stringsAsFactors = FALSE)
library("spacyr")
library("dplyr", warn.conflicts = FALSE)
spacy_parse(structure(df$text, names = df$id),
lemma = TRUE, pos = FALSE) %>%
mutate(id = doc_id) %>%
group_by(id) %>%
summarize(text = paste(lemma, collapse = " "))
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.0, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## id text
## <chr> <chr>
## 1 102 the boy 's car be different color
## 2 103 the hare run fast !
有这样的数据框:
library(spacyr)
df <- data.frame(id = c(102), text = c("the boy's cars are different colors"), stringsAsFactors = FALSE)
可以像这样进行词性标注:
df2 <- spacy_parse(df$text, pos = TRUE, lemma = FALSE)
并使用它将其转换为每个文档一行
df3 <- aggregate(lemma ~ doc_id, df2, paste, collapse = " ")
而不是 doc_id 如何保留 ID?
该过程提供了一个 doc_id 但是我想合并输入数据帧的 ID 并解析数据。
预期输出示例
df <- data.frame(id = c(102), text = c("the boy's car be different color"),
stringsAsFactors = FALSE)
你可以这样做。我正在使用 dplyr 而不是 aggregate()
,并且我已经添加到您的示例中。
df <- data.frame(
id = c(102, 103),
text = c(
"the boy's cars are different colors",
"The hare ran faster!"
),
stringsAsFactors = FALSE)
library("spacyr")
library("dplyr", warn.conflicts = FALSE)
spacy_parse(structure(df$text, names = df$id),
lemma = TRUE, pos = FALSE) %>%
mutate(id = doc_id) %>%
group_by(id) %>%
summarize(text = paste(lemma, collapse = " "))
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.0, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## id text
## <chr> <chr>
## 1 102 the boy 's car be different color
## 2 103 the hare run fast !