按列计算 POS 标签
Count POS Tags by column
我正在尝试连续计算所有词性标记并求和。
现在我达到了两个输出:
1) The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/.
2) c("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".")
在此特定示例中,理想的输出是:
DT NN VBD WP VBP PRP VBG TO VB
1 doc 1 1 1 1 1 1 1 1 1
但是因为我想为数据框中的整个列创建它,所以我想在列中也看到 0 个值,这对应于这句话中未使用的 POS 标记。
示例:
1 doc = "The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/"
2 doc = "Response/NN ?/."
输出:
DT NN VBD WP VBP PRP VBG TO VB
1 doc 1 1 1 1 1 1 1 1 1
2 doc 0 1 0 0 0 0 0 0 0
我现在做了什么:
library(stringr)
#Spliting into sentence based on carriage return
s <- unlist(lapply(df$sentence, function(x) { str_split(x, "\n") }))
library(NLP)
library(openNLP)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
result <- lapply(s,tagPOS)
result <- as.data.frame(do.call(rbind,result))
这就是我如何达到开头描述的输出
我试过计算这样的次数:
出现次数<-as.data.frame (table(unlist(result$POStags)))
但它计算了整个数据帧的出现次数。我需要为现有数据框创建新列并计算第一列中的出现次数。
有人能帮帮我吗? :(
使用tm
相对无痛:
虚拟数据
require(tm)
df <- data.frame(ID = c("doc1","doc2"),
tags = c(paste("NN"),
paste("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".")))
制作语料库和DocumentTermMatrix:
corpus <- Corpus(VectorSource(df$tags))
#default minimum wordlength is 3, so make sure you change this
dtm <- DocumentTermMatrix(corpus, control= list(wordLengths=c(1,Inf)))
#see what you've done
inspect(dtm)
<<DocumentTermMatrix (documents: 2, terms: 9)>>
Non-/sparse entries: 10/8
Sparsity : 44%
Maximal term length: 3
Weighting : term frequency (tf)
Sample :
Terms
Docs dt nn prp to vb vbd vbg vbp wp
1 0 1 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1
eta:如果您不喜欢使用 dtm,您可以将其强制转换为数据帧:
as.data.frame(as.matrix(dtm))
nn dt prp to vb vbd vbg vbp wp
1 1 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1
eta2: Corpus
仅创建一个 df$tags
列的语料库,并且 VectorSource
假设数据中的每一行都是一个文档,因此数据框中行的顺序 df
,和 DocumentTermMatrix
中文档的顺序是一样的:我可以 cbind
df$ID
到输出数据帧上。我使用 dplyr
执行此操作,因为我认为它会产生最易读的代码(将 %>%
读作 "and then"):
require(dplyr)
result <- as.data.frame(as.matrix(dtm)) %>%
bind_col(df$ID)
我正在尝试连续计算所有词性标记并求和。
现在我达到了两个输出:
1) The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/.
2) c("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".")
在此特定示例中,理想的输出是:
DT NN VBD WP VBP PRP VBG TO VB
1 doc 1 1 1 1 1 1 1 1 1
但是因为我想为数据框中的整个列创建它,所以我想在列中也看到 0 个值,这对应于这句话中未使用的 POS 标记。
示例:
1 doc = "The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/"
2 doc = "Response/NN ?/."
输出:
DT NN VBD WP VBP PRP VBG TO VB
1 doc 1 1 1 1 1 1 1 1 1
2 doc 0 1 0 0 0 0 0 0 0
我现在做了什么:
library(stringr)
#Spliting into sentence based on carriage return
s <- unlist(lapply(df$sentence, function(x) { str_split(x, "\n") }))
library(NLP)
library(openNLP)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
result <- lapply(s,tagPOS)
result <- as.data.frame(do.call(rbind,result))
这就是我如何达到开头描述的输出
我试过计算这样的次数: 出现次数<-as.data.frame (table(unlist(result$POStags)))
但它计算了整个数据帧的出现次数。我需要为现有数据框创建新列并计算第一列中的出现次数。
有人能帮帮我吗? :(
使用tm
相对无痛:
虚拟数据
require(tm)
df <- data.frame(ID = c("doc1","doc2"),
tags = c(paste("NN"),
paste("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".")))
制作语料库和DocumentTermMatrix:
corpus <- Corpus(VectorSource(df$tags))
#default minimum wordlength is 3, so make sure you change this
dtm <- DocumentTermMatrix(corpus, control= list(wordLengths=c(1,Inf)))
#see what you've done
inspect(dtm)
<<DocumentTermMatrix (documents: 2, terms: 9)>>
Non-/sparse entries: 10/8
Sparsity : 44%
Maximal term length: 3
Weighting : term frequency (tf)
Sample :
Terms
Docs dt nn prp to vb vbd vbg vbp wp
1 0 1 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1
eta:如果您不喜欢使用 dtm,您可以将其强制转换为数据帧:
as.data.frame(as.matrix(dtm))
nn dt prp to vb vbd vbg vbp wp
1 1 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1
eta2: Corpus
仅创建一个 df$tags
列的语料库,并且 VectorSource
假设数据中的每一行都是一个文档,因此数据框中行的顺序 df
,和 DocumentTermMatrix
中文档的顺序是一样的:我可以 cbind
df$ID
到输出数据帧上。我使用 dplyr
执行此操作,因为我认为它会产生最易读的代码(将 %>%
读作 "and then"):
require(dplyr)
result <- as.data.frame(as.matrix(dtm)) %>%
bind_col(df$ID)