如何使用 OpenNLP 在 R 中获取 POS 标签?
How to use OpenNLP to get POS tags in R?
这是 R 代码:
library(NLP)
library(openNLP)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <- tagPOS(str)
输出为:
tagged_str
$POStagged
[1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./."
现在我只想从上面的句子中提取 NN 个单词,即句子,并将其存储到一个变量中。谁能帮我解决这个问题。
可能有更优雅的方法来获得结果,但这个应该有效:
q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"
希望对您有所帮助。
这是一个更通用的解决方案,您可以在其中使用正则表达式描述要提取的 Treebank 标签。例如,在您的情况下,"NN" returns 所有名词类型(例如 NN、NNS、NNP、NNPS),而 "NN$" returns 只是 NN。
它对字符类型进行操作,因此如果您将文本作为列表,则需要 lapply()
如下例所示。
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
extractPOS <- function(x, thisPOSregex) {
x <- as.String(x)
wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
POSwords <- subset(POSAnnotation, type == "word")
tags <- sapply(POSwords$features, '[[', "POS")
thisPOSindex <- grep(thisPOSregex, tags)
tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
untokenizedAndTagged
}
lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
##
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
##
## [[2]]
## [1] ""
这是另一个使用 spaCy parser and tagger, from Python, and the spacyr 包调用它的答案。
这个库比斯坦福 NLP 模型快几个数量级,而且几乎和斯坦福 NLP 模型一样好。它在某些语言中仍然不完整,但对于英语来说是一个非常好的和有前途的选择。
您首先需要安装 Python 并安装 spaCy 和语言模块。说明可从 spaCy page and here.
然后:
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)
spacy_parse(txt, pos = TRUE, tag = TRUE)
## doc_id sentence_id token_id token lemma pos tag entity
## 1 text1 1 1 This this DET DT
## 2 text1 1 2 is be VERB VBZ
## 3 text1 1 3 a a DET DT
## 4 text1 1 4 short short ADJ JJ
## 5 text1 1 5 tagging tagging NOUN NN
## 6 text1 1 6 example example NOUN NN
## 7 text1 1 7 , , PUNCT ,
## 8 text1 1 8 by by ADP IN
## 9 text1 1 9 John john PROPN NNP PERSON_B
## 10 text1 1 10 Doe doe PROPN NNP PERSON_I
## 11 text1 1 11 . . PUNCT .
## 12 text2 1 1 Too too ADV RB
## 13 text2 1 2 bad bad ADJ JJ
## 14 text2 1 3 OpenNLP opennlp PROPN NNP
## 15 text2 1 4 is be VERB VBZ
## 16 text2 1 5 so so ADV RB
## 17 text2 1 6 slow slow ADJ JJ
## 18 text2 1 7 on on ADP IN
## 19 text2 1 8 large large ADJ JJ
## 20 text2 1 9 texts text NOUN NNS
## 21 text2 1 10 . . PUNCT .
这是 R 代码:
library(NLP)
library(openNLP)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <- tagPOS(str)
输出为:
tagged_str $POStagged [1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./."
现在我只想从上面的句子中提取 NN 个单词,即句子,并将其存储到一个变量中。谁能帮我解决这个问题。
可能有更优雅的方法来获得结果,但这个应该有效:
q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"
希望对您有所帮助。
这是一个更通用的解决方案,您可以在其中使用正则表达式描述要提取的 Treebank 标签。例如,在您的情况下,"NN" returns 所有名词类型(例如 NN、NNS、NNP、NNPS),而 "NN$" returns 只是 NN。
它对字符类型进行操作,因此如果您将文本作为列表,则需要 lapply()
如下例所示。
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
extractPOS <- function(x, thisPOSregex) {
x <- as.String(x)
wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
POSwords <- subset(POSAnnotation, type == "word")
tags <- sapply(POSwords$features, '[[', "POS")
thisPOSindex <- grep(thisPOSregex, tags)
tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
untokenizedAndTagged
}
lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
##
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
##
## [[2]]
## [1] ""
这是另一个使用 spaCy parser and tagger, from Python, and the spacyr 包调用它的答案。
这个库比斯坦福 NLP 模型快几个数量级,而且几乎和斯坦福 NLP 模型一样好。它在某些语言中仍然不完整,但对于英语来说是一个非常好的和有前途的选择。
您首先需要安装 Python 并安装 spaCy 和语言模块。说明可从 spaCy page and here.
然后:
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)
spacy_parse(txt, pos = TRUE, tag = TRUE)
## doc_id sentence_id token_id token lemma pos tag entity
## 1 text1 1 1 This this DET DT
## 2 text1 1 2 is be VERB VBZ
## 3 text1 1 3 a a DET DT
## 4 text1 1 4 short short ADJ JJ
## 5 text1 1 5 tagging tagging NOUN NN
## 6 text1 1 6 example example NOUN NN
## 7 text1 1 7 , , PUNCT ,
## 8 text1 1 8 by by ADP IN
## 9 text1 1 9 John john PROPN NNP PERSON_B
## 10 text1 1 10 Doe doe PROPN NNP PERSON_I
## 11 text1 1 11 . . PUNCT .
## 12 text2 1 1 Too too ADV RB
## 13 text2 1 2 bad bad ADJ JJ
## 14 text2 1 3 OpenNLP opennlp PROPN NNP
## 15 text2 1 4 is be VERB VBZ
## 16 text2 1 5 so so ADV RB
## 17 text2 1 6 slow slow ADJ JJ
## 18 text2 1 7 on on ADP IN
## 19 text2 1 8 large large ADJ JJ
## 20 text2 1 9 texts text NOUN NNS
## 21 text2 1 10 . . PUNCT .