在表示为单个标记数据框的句子中搜索一系列有序标记

Search for series of ordered tokens in sentences represented as a dataframe of individual tokens

我正在尝试学习更多关于语料库、R 中的单词分析的知识。最近我开始使用 CleanNLP 和 Spacy Backend。问题是,在解析文本后,我想看看一个句子是否有标记不同关系的标记。

假设,

library(cleanNLP)
library(tidyverse)
text <- cnlp_annotate(c("I gave him money"))

结果会是

 doc_id   sid   tid token token_with_ws lemma  upos  xpos  tid_source relation
   <int> <int> <int> <chr> <chr>         <chr>  <chr> <chr>      <int> <chr>   
1      1     1     1 I     "I "          -PRON- PRON  PRP            2 nsubj   
2      1     1     2 gave  "gave "       give   VERB  VBD            0 root    
3      1     1     3 money "money "      money  NOUN  NN             2 dobj    
4      1     1     4 to    "to "         to     ADP   IN             2 dative  
5      1     1     5 him   "him"         -PRON- PRON  PRP            4 pobj 

我通过

改变了数据框
dative <- c("dative")
     anno %>%
+     filter(grepl(dative, relation)) %>% 
+     select(sid, sentence)

并查找前后上下文

anno %>%
+     mutate(kwic = ifelse(grepl(dative, relation),
+                          TRUE, FALSE)) %>%
+     mutate(before = gsub("NA\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
+            after = gsub("NA\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
+     ) %>%
+     filter(kwic) %>%
+     select(before, token, after)

我想从语料库中提取包含所有三个关系标签的句子 (dobj, dative, pobj)。换句话说,如果上下文前后有标签 "dobj""pobj".

,我想检查前后上下文并提取句子

所以基本上,我想提取带有模式 Dobj、Dative、Pobj 的句子(带有双宾语的句子;我给了他钱)但不是带有一个或两个变量的模式,比如说 Dobj只要; I gave the money or 介词 + Pobj;我给了他

我该怎么做?非常感谢任何帮助

到目前为止,在@GeoffreyPoole 的大力帮助下,我已经设法获得了名单。通过对下面的代码进行一些编辑,输出为;

target <- "root dobj dative pobj"
text %>%
  select(sid, relation, lemma) %>%
  
  # get rid of any sentences with less than three words...
  group_by(sid) %>%
  summarize(n = n()) %>%
  filter(n >= 4) %>%
  left_join(text) %>%
  
  # make sure tokens are in order...
  arrange(sid, tid, lemma) %>%
  
  # now, for each sentence...
  group_by(sid) %>%
  group_modify(
    function(x,y,z) {
      #paste together each triplet of relations and convert to a dataframe.
      rollapply(x[,c("relation", "token")], 4, paste, collapse = " ") %>%
                              as.data.frame
    }
  ) %>% 
  
  # get all unique combinations of sid and pasted triplets
  distinct %>%
  
  # select records with the desired pasted triplet
  filter(relation == target) %>%
  
  # and pull all of the tokens for associated sentences from text
  left_join(text)

sid relation              token                             doc_id   tid token_with_ws lemma upos  xpos  tid_source
   <int> <chr>                 <chr>                              <int> <int> <chr>         <chr> <chr> <chr>      <int>
 1   949 root dobj dative pobj gives ideas to people                 NA    NA NA            NA    NA    NA            NA
 2  1242 root dobj dative pobj provided advantages for customers     NA    NA NA            NA    NA    NA            NA
 3  1631 root dobj dative pobj give harm to themselves               NA    NA NA            NA    NA    NA            NA
 4  2275 root dobj dative pobj say this to us                        NA    NA NA            NA    NA    NA            NA
 5  3016 root dobj dative pobj write fine to you                     NA    NA NA            NA    NA    NA            NA
 6  3826 root dobj dative pobj cause problem for society             NA    NA NA            NA    NA    NA            NA
 7  4184 root dobj dative pobj gives harm to women                   NA    NA NA            NA    NA    NA            NA

只剩下一个问题了,我需要编辑target才能看到更多关系吗?例如当target <- "root dobj dative pobj", 结果是

1242 root dobj dative pobj provided advantages for customers

如果实际句子是

会怎样

"provided advantages for the customers"

我是否需要将 target 重写为 "root dobj dative (det) pobj" 才能观察到这些模式?

谢谢。

@Fatih 提出的修改后的问题让我意识到这个问题的答案比我最初发布的更可靠(也更有效)。

关键是用词类而不是标记(单词)本身来造句。然后使用regex(例如grepl())找到具有所需模式的“句子”。

测试数据如下:

> text
# A tibble: 16 x 4
     sid   tid token     upos 
   <int> <int> <chr>     <chr>
 1     1     1 When      ADV  
 2     1     2 you       PRON 
 3     1     3 ’re       VERB 
 4     1     4 traveling VERB 
 5     2     1 You       PRON 
 6     2     2 also      ADV  
 7     2     3 see       VERB 
 8     2     4 a         DET  
 9     3     1 These     DET  
10     3     2 strings   NOUN 
11     3     3 of        ADP  
12     3     4 beads     NOUN 
13     4     1 They      PRON 
14     4     2 have      AUX  
15     4     3 been      AUX  
16     4     4 used      VERB 

假设我们要查找模式为“ADV VERB”或“ADV PRON VERB”的句子。正则表达式如下所示:

regex = "ADV (PRON )?VERB"

所以让我们用词性构建一些“句子”:

library(dplyr)

posSentences = 
  text %>%
  arrange(sid, tid) %>%
  group_by(sid) %>%
  summarize(uposSentence = paste(upos, collapse = " "))

“句子”看起来像这样:

> posSentences
# A tibble: 4 x 2
    sid uposSentence      
  <int> <chr>             
1     1 ADV PRON VERB VERB
2     2 PRON ADV VERB DET 
3     3 DET NOUN ADP NOUN 
4     4 PRON AUX AUX VERB

你可以看到前两个句子有我们想要的模式。后两个没有。现在只要用grepl就可以找到符合正则表达式的:

theAnswer = filter(posSentences, grepl(regex, posSentences$uposSentence))

我们完成了:

> theAnswer
# A tibble: 2 x 2
    sid uposSentence      
  <int> <chr>             
1     1 ADV PRON VERB VERB
2     2 PRON ADV VERB DET 

您可以通过以下方式返回到这些句子中的标记:

filter(text, sid %in% theAnswer$sid)

在这种情况下会产生:

# A tibble: 8 x 4
    sid   tid token     upos 
  <int> <int> <chr>     <chr>
1     1     1 When      ADV  
2     1     2 you       PRON 
3     1     3 ’re       VERB 
4     1     4 traveling VERB 
5     2     1 You       PRON 
6     2     2 also      ADV  
7     2     3 see       VERB 
8     2     4 a         DET  

上述方法比我在@Fatih 的问题范围更窄时提供的方法(寻找三个词性的特定模式)快得多,也更灵活。所以我以前的回答没有实际意义,但我把它留在下面以防它对任何人有用。


原始答案(针对 3 个值的特定模式)


这是使用 dplyr::group_modifyzoo::rollapply 的解决方案。基本上,通过将 rollapply 包裹在 group_modify 中,您可以 rollapply 跨越每个句子,并且 paste 每个三元组关系一起成为一个字符串。然后,只需 filter 即可获得所需的 target 字符串。根据您的 objective.

,您可能希望也可能不希望删除 运行 这段代码之前 text 中的所有标点符号
library(zoo)
library(dplyr)

target = "dobj dative pobj"

text %>%
  select(sid, relation) %>%

  # get rid of any sentences with less than three words...
  group_by(sid) %>%
  summarize(n = n()) %>%
  filter(n >= 3) %>%
  left_join(text) %>%

  # make sure tokens are in order...
  arrange(sid, tid) %>%

  # now, for each sentence...
  group_by(sid) %>%
  group_modify(
    function(x,y) {
      #paste together each triplet of relations and convert to a dataframe.
      rollapply(x[,"relation"], 3, paste, collapse = " ") %>%
        as.data.frame
    }
  ) %>% 

  # get all unique combinations of sid and pasted triplets
  distinct %>%

  # select records with the desired pasted triplet
  filter(relation == target) %>%

  # and pull all of the tokens for associated sentences from text
  left_join(text)