使用 lapply 将 ID 列添加到 R CoreNLP 包分词器输出

Question

我有使用 lapply 和 do.call 从 CoreNLP 获取分词器输出的工作代码。如果可能的话，我需要帮助来实现两件事：

在应用函数本身中添加文档 ID（当前代码没有添加此列）
在应用函数本身中实现 do.call 的结果（如果可能）

有一个 post parallel parLapply setup 使用了 "lapply" 函数。但它只适用于文本向量，不考虑 id 列。

代码：

#Fake data - Quotes from Great Expectations by Charles Dickens
textcolumn<-c("The broken heart. You think you will die, but you just keep living, day after day after terrible day.",
              "We need never be ashamed of our tears.")
DocId <-c(1:length(textcolumn))

options( java.parameters = "-Xmx2g" ) 
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos")

# Function to tokenize
tokenize <- function(textcolumn) {
  tmp<-annotateString(textcolumn)
  tokens<-getToken(tmp)
  colnames(tokens)<-tolower(colnames(tokens))
  tokens[,c("sentence", "id", "token" ,"pos")]
}

result <- lapply(textcolumn,tokenize)
final <- do.call(rbind,result)

输出

> final

   sentence id    token  pos
1         1  1      The   DT
2         1  2   broken   JJ
3         1  3    heart   NN
4         1  4        .    .
5         2  1      You  PRP
6         2  2    think  VBP
7         2  3      you  PRP
8         2  4     will   MD
9         2  5      die   VB
10        2  6        ,    ,
11        2  7      but   CC
12        2  8      you  PRP
13        2  9     just   RB
14        2 10     keep  VBP
15        2 11   living   NN
16        2 12        ,    ,
17        2 13      day   NN
18        2 14    after   IN
19        2 15      day   NN
20        2 16    after   IN
21        2 17 terrible   JJ
22        2 18      day   NN
23        2 19        .    .
24        1  1       We  PRP
25        1  2     need  VBP
26        1  3    never   RB
27        1  4       be   VB
28        1  5  ashamed   JJ
29        1  6       of   IN
30        1  7      our PRP$
31        1  8    tears  NNS
32        1  9        .    .

Answer 1

我想出了如何在分词器函数中将文档 ID 添加到 CoreNLP 的输出中。由于 lapply 不允许 2 个参数，我不得不切换到 mapply。此外，我必须将函数的输出转换为列表以获得正确的输出。

代码：

#Fake data - Quotes from Great Expectations by Charles Dickens
textcolumn<-c("The broken heart. You think you will die, but you just keep living, day after day after terrible day.",
              "We need never be ashamed of our tears.")
DocId <-c(1:length(textcolumn))

options( java.parameters = "-Xmx2g" ) 
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos")

# Function to tokenize
tokenize <- function(textcolumn,DocId) {
  tmp<-annotateString(textcolumn)
  tokens<-getToken(tmp)
  colnames(tokens)<-tolower(colnames(tokens))
  tokens <- tokens[,c("sentence", "id", "token" ,"pos")] # keeping only few columns
  colnames(tokens) <- c("sentence", "tokenid", "token" ,"pos")
  DocId <- rep(DocId,length(tokens[,1]))
  docidtokens<-cbind(DocId,tokens)
  docidtokens <- list(docidtokens) # need to convert into list for proper output
}

result <- mapply(tokenize, textcolumn, DocId)
final <- do.call(rbind,result)

输出：

> print(final, row.names = FALSE)
 DocId sentence tokenid    token  pos
     1        1       1      The   DT
     1        1       2   broken   JJ
     1        1       3    heart   NN
     1        1       4        .    .
     1        2       1      You  PRP
     1        2       2    think  VBP
     1        2       3      you  PRP
     1        2       4     will   MD
     1        2       5      die   VB
     1        2       6        ,    ,
     1        2       7      but   CC
     1        2       8      you  PRP
     1        2       9     just   RB
     1        2      10     keep  VBP
     1        2      11   living   NN
     1        2      12        ,    ,
     1        2      13      day   NN
     1        2      14    after   IN
     1        2      15      day   NN
     1        2      16    after   IN
     1        2      17 terrible   JJ
     1        2      18      day   NN
     1        2      19        .    .
     2        1       1       We  PRP
     2        1       2     need  VBP
     2        1       3    never   RB
     2        1       4       be   VB
     2        1       5  ashamed   JJ
     2        1       6       of   IN
     2        1       7      our PRP$
     2        1       8    tears  NNS
     2        1       9        .    .

使用 lapply 将 ID 列添加到 R CoreNLP 包分词器输出

Add ID column to R CoreNLP package tokenizer output using lapply

r

stanford-nlp

lapply

do.call