使用 lapply 将 ID 列添加到 R CoreNLP 包分词器输出
Add ID column to R CoreNLP package tokenizer output using lapply
我有使用 lapply 和 do.call 从 CoreNLP 获取分词器输出的工作代码。如果可能的话,我需要帮助来实现两件事:
- 在应用函数本身中添加文档 ID(当前代码没有添加此列)
- 在应用函数本身中实现 do.call 的结果(如果可能)
有一个 post parallel parLapply setup 使用了 "lapply" 函数。但它只适用于文本向量,不考虑 id 列。
代码:
#Fake data - Quotes from Great Expectations by Charles Dickens
textcolumn<-c("The broken heart. You think you will die, but you just keep living, day after day after terrible day.",
"We need never be ashamed of our tears.")
DocId <-c(1:length(textcolumn))
options( java.parameters = "-Xmx2g" )
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos")
# Function to tokenize
tokenize <- function(textcolumn) {
tmp<-annotateString(textcolumn)
tokens<-getToken(tmp)
colnames(tokens)<-tolower(colnames(tokens))
tokens[,c("sentence", "id", "token" ,"pos")]
}
result <- lapply(textcolumn,tokenize)
final <- do.call(rbind,result)
输出
> final
sentence id token pos
1 1 1 The DT
2 1 2 broken JJ
3 1 3 heart NN
4 1 4 . .
5 2 1 You PRP
6 2 2 think VBP
7 2 3 you PRP
8 2 4 will MD
9 2 5 die VB
10 2 6 , ,
11 2 7 but CC
12 2 8 you PRP
13 2 9 just RB
14 2 10 keep VBP
15 2 11 living NN
16 2 12 , ,
17 2 13 day NN
18 2 14 after IN
19 2 15 day NN
20 2 16 after IN
21 2 17 terrible JJ
22 2 18 day NN
23 2 19 . .
24 1 1 We PRP
25 1 2 need VBP
26 1 3 never RB
27 1 4 be VB
28 1 5 ashamed JJ
29 1 6 of IN
30 1 7 our PRP$
31 1 8 tears NNS
32 1 9 . .
我想出了如何在分词器函数中将文档 ID 添加到 CoreNLP 的输出中。由于 lapply 不允许 2 个参数,我不得不切换到 mapply。此外,我必须将函数的输出转换为列表以获得正确的输出。
代码:
#Fake data - Quotes from Great Expectations by Charles Dickens
textcolumn<-c("The broken heart. You think you will die, but you just keep living, day after day after terrible day.",
"We need never be ashamed of our tears.")
DocId <-c(1:length(textcolumn))
options( java.parameters = "-Xmx2g" )
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos")
# Function to tokenize
tokenize <- function(textcolumn,DocId) {
tmp<-annotateString(textcolumn)
tokens<-getToken(tmp)
colnames(tokens)<-tolower(colnames(tokens))
tokens <- tokens[,c("sentence", "id", "token" ,"pos")] # keeping only few columns
colnames(tokens) <- c("sentence", "tokenid", "token" ,"pos")
DocId <- rep(DocId,length(tokens[,1]))
docidtokens<-cbind(DocId,tokens)
docidtokens <- list(docidtokens) # need to convert into list for proper output
}
result <- mapply(tokenize, textcolumn, DocId)
final <- do.call(rbind,result)
输出:
> print(final, row.names = FALSE)
DocId sentence tokenid token pos
1 1 1 The DT
1 1 2 broken JJ
1 1 3 heart NN
1 1 4 . .
1 2 1 You PRP
1 2 2 think VBP
1 2 3 you PRP
1 2 4 will MD
1 2 5 die VB
1 2 6 , ,
1 2 7 but CC
1 2 8 you PRP
1 2 9 just RB
1 2 10 keep VBP
1 2 11 living NN
1 2 12 , ,
1 2 13 day NN
1 2 14 after IN
1 2 15 day NN
1 2 16 after IN
1 2 17 terrible JJ
1 2 18 day NN
1 2 19 . .
2 1 1 We PRP
2 1 2 need VBP
2 1 3 never RB
2 1 4 be VB
2 1 5 ashamed JJ
2 1 6 of IN
2 1 7 our PRP$
2 1 8 tears NNS
2 1 9 . .
我有使用 lapply 和 do.call 从 CoreNLP 获取分词器输出的工作代码。如果可能的话,我需要帮助来实现两件事:
- 在应用函数本身中添加文档 ID(当前代码没有添加此列)
- 在应用函数本身中实现 do.call 的结果(如果可能)
有一个 post parallel parLapply setup 使用了 "lapply" 函数。但它只适用于文本向量,不考虑 id 列。
代码:
#Fake data - Quotes from Great Expectations by Charles Dickens
textcolumn<-c("The broken heart. You think you will die, but you just keep living, day after day after terrible day.",
"We need never be ashamed of our tears.")
DocId <-c(1:length(textcolumn))
options( java.parameters = "-Xmx2g" )
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos")
# Function to tokenize
tokenize <- function(textcolumn) {
tmp<-annotateString(textcolumn)
tokens<-getToken(tmp)
colnames(tokens)<-tolower(colnames(tokens))
tokens[,c("sentence", "id", "token" ,"pos")]
}
result <- lapply(textcolumn,tokenize)
final <- do.call(rbind,result)
输出
> final
sentence id token pos
1 1 1 The DT
2 1 2 broken JJ
3 1 3 heart NN
4 1 4 . .
5 2 1 You PRP
6 2 2 think VBP
7 2 3 you PRP
8 2 4 will MD
9 2 5 die VB
10 2 6 , ,
11 2 7 but CC
12 2 8 you PRP
13 2 9 just RB
14 2 10 keep VBP
15 2 11 living NN
16 2 12 , ,
17 2 13 day NN
18 2 14 after IN
19 2 15 day NN
20 2 16 after IN
21 2 17 terrible JJ
22 2 18 day NN
23 2 19 . .
24 1 1 We PRP
25 1 2 need VBP
26 1 3 never RB
27 1 4 be VB
28 1 5 ashamed JJ
29 1 6 of IN
30 1 7 our PRP$
31 1 8 tears NNS
32 1 9 . .
我想出了如何在分词器函数中将文档 ID 添加到 CoreNLP 的输出中。由于 lapply 不允许 2 个参数,我不得不切换到 mapply。此外,我必须将函数的输出转换为列表以获得正确的输出。
代码:
#Fake data - Quotes from Great Expectations by Charles Dickens
textcolumn<-c("The broken heart. You think you will die, but you just keep living, day after day after terrible day.",
"We need never be ashamed of our tears.")
DocId <-c(1:length(textcolumn))
options( java.parameters = "-Xmx2g" )
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos")
# Function to tokenize
tokenize <- function(textcolumn,DocId) {
tmp<-annotateString(textcolumn)
tokens<-getToken(tmp)
colnames(tokens)<-tolower(colnames(tokens))
tokens <- tokens[,c("sentence", "id", "token" ,"pos")] # keeping only few columns
colnames(tokens) <- c("sentence", "tokenid", "token" ,"pos")
DocId <- rep(DocId,length(tokens[,1]))
docidtokens<-cbind(DocId,tokens)
docidtokens <- list(docidtokens) # need to convert into list for proper output
}
result <- mapply(tokenize, textcolumn, DocId)
final <- do.call(rbind,result)
输出:
> print(final, row.names = FALSE)
DocId sentence tokenid token pos
1 1 1 The DT
1 1 2 broken JJ
1 1 3 heart NN
1 1 4 . .
1 2 1 You PRP
1 2 2 think VBP
1 2 3 you PRP
1 2 4 will MD
1 2 5 die VB
1 2 6 , ,
1 2 7 but CC
1 2 8 you PRP
1 2 9 just RB
1 2 10 keep VBP
1 2 11 living NN
1 2 12 , ,
1 2 13 day NN
1 2 14 after IN
1 2 15 day NN
1 2 16 after IN
1 2 17 terrible JJ
1 2 18 day NN
1 2 19 . .
2 1 1 We PRP
2 1 2 need VBP
2 1 3 never RB
2 1 4 be VB
2 1 5 ashamed JJ
2 1 6 of IN
2 1 7 our PRP$
2 1 8 tears NNS
2 1 9 . .