在表示为单个标记数据框的句子中搜索一系列有序标记
Search for series of ordered tokens in sentences represented as a dataframe of individual tokens
我正在尝试学习更多关于语料库、R 中的单词分析的知识。最近我开始使用 CleanNLP 和 Spacy Backend。问题是,在解析文本后,我想看看一个句子是否有标记不同关系的标记。
假设,
library(cleanNLP)
library(tidyverse)
text <- cnlp_annotate(c("I gave him money"))
结果会是
doc_id sid tid token token_with_ws lemma upos xpos tid_source relation
<int> <int> <int> <chr> <chr> <chr> <chr> <chr> <int> <chr>
1 1 1 1 I "I " -PRON- PRON PRP 2 nsubj
2 1 1 2 gave "gave " give VERB VBD 0 root
3 1 1 3 money "money " money NOUN NN 2 dobj
4 1 1 4 to "to " to ADP IN 2 dative
5 1 1 5 him "him" -PRON- PRON PRP 4 pobj
我通过
改变了数据框
dative <- c("dative")
anno %>%
+ filter(grepl(dative, relation)) %>%
+ select(sid, sentence)
并查找前后上下文
anno %>%
+ mutate(kwic = ifelse(grepl(dative, relation),
+ TRUE, FALSE)) %>%
+ mutate(before = gsub("NA\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
+ after = gsub("NA\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
+ ) %>%
+ filter(kwic) %>%
+ select(before, token, after)
我想从语料库中提取包含所有三个关系标签的句子 (dobj, dative, pobj
)。换句话说,如果上下文前后有标签 "dobj"
和 "pobj"
.
,我想检查前后上下文并提取句子
所以基本上,我想提取带有模式 Dobj、Dative、Pobj 的句子(带有双宾语的句子;我给了他钱)但不是带有一个或两个变量的模式,比如说 Dobj只要; I gave the money or 介词 + Pobj;我给了他
我该怎么做?非常感谢任何帮助
到目前为止,在@GeoffreyPoole 的大力帮助下,我已经设法获得了名单。通过对下面的代码进行一些编辑,输出为;
target <- "root dobj dative pobj"
text %>%
select(sid, relation, lemma) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 4) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid, lemma) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y,z) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,c("relation", "token")], 4, paste, collapse = " ") %>%
as.data.frame
}
) %>%
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)
sid relation token doc_id tid token_with_ws lemma upos xpos tid_source
<int> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <int>
1 949 root dobj dative pobj gives ideas to people NA NA NA NA NA NA NA
2 1242 root dobj dative pobj provided advantages for customers NA NA NA NA NA NA NA
3 1631 root dobj dative pobj give harm to themselves NA NA NA NA NA NA NA
4 2275 root dobj dative pobj say this to us NA NA NA NA NA NA NA
5 3016 root dobj dative pobj write fine to you NA NA NA NA NA NA NA
6 3826 root dobj dative pobj cause problem for society NA NA NA NA NA NA NA
7 4184 root dobj dative pobj gives harm to women NA NA NA NA NA NA NA
只剩下一个问题了,我需要编辑target
才能看到更多关系吗?例如当target <- "root dobj dative pobj"
,
结果是
1242 root dobj dative pobj provided advantages for customers
如果实际句子是
会怎样
"provided advantages for the
customers"
我是否需要将 target
重写为 "root dobj dative (det) pobj"
才能观察到这些模式?
谢谢。
@Fatih 提出的修改后的问题让我意识到这个问题的答案比我最初发布的更可靠(也更有效)。
关键是用词类而不是标记(单词)本身来造句。然后使用regex
(例如grepl()
)找到具有所需模式的“句子”。
测试数据如下:
> text
# A tibble: 16 x 4
sid tid token upos
<int> <int> <chr> <chr>
1 1 1 When ADV
2 1 2 you PRON
3 1 3 ’re VERB
4 1 4 traveling VERB
5 2 1 You PRON
6 2 2 also ADV
7 2 3 see VERB
8 2 4 a DET
9 3 1 These DET
10 3 2 strings NOUN
11 3 3 of ADP
12 3 4 beads NOUN
13 4 1 They PRON
14 4 2 have AUX
15 4 3 been AUX
16 4 4 used VERB
假设我们要查找模式为“ADV VERB”或“ADV PRON VERB”的句子。正则表达式如下所示:
regex = "ADV (PRON )?VERB"
所以让我们用词性构建一些“句子”:
library(dplyr)
posSentences =
text %>%
arrange(sid, tid) %>%
group_by(sid) %>%
summarize(uposSentence = paste(upos, collapse = " "))
“句子”看起来像这样:
> posSentences
# A tibble: 4 x 2
sid uposSentence
<int> <chr>
1 1 ADV PRON VERB VERB
2 2 PRON ADV VERB DET
3 3 DET NOUN ADP NOUN
4 4 PRON AUX AUX VERB
你可以看到前两个句子有我们想要的模式。后两个没有。现在只要用grepl
就可以找到符合正则表达式的:
theAnswer = filter(posSentences, grepl(regex, posSentences$uposSentence))
我们完成了:
> theAnswer
# A tibble: 2 x 2
sid uposSentence
<int> <chr>
1 1 ADV PRON VERB VERB
2 2 PRON ADV VERB DET
您可以通过以下方式返回到这些句子中的标记:
filter(text, sid %in% theAnswer$sid)
在这种情况下会产生:
# A tibble: 8 x 4
sid tid token upos
<int> <int> <chr> <chr>
1 1 1 When ADV
2 1 2 you PRON
3 1 3 ’re VERB
4 1 4 traveling VERB
5 2 1 You PRON
6 2 2 also ADV
7 2 3 see VERB
8 2 4 a DET
上述方法比我在@Fatih 的问题范围更窄时提供的方法(寻找三个词性的特定模式)快得多,也更灵活。所以我以前的回答没有实际意义,但我把它留在下面以防它对任何人有用。
原始答案(针对 3 个值的特定模式)
这是使用 dplyr::group_modify
和 zoo::rollapply
的解决方案。基本上,通过将 rollapply
包裹在 group_modify
中,您可以 rollapply
跨越每个句子,并且 paste
每个三元组关系一起成为一个字符串。然后,只需 filter
即可获得所需的 target
字符串。根据您的 objective.
,您可能希望也可能不希望删除 运行 这段代码之前 text
中的所有标点符号
library(zoo)
library(dplyr)
target = "dobj dative pobj"
text %>%
select(sid, relation) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 3) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,"relation"], 3, paste, collapse = " ") %>%
as.data.frame
}
) %>%
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)
我正在尝试学习更多关于语料库、R 中的单词分析的知识。最近我开始使用 CleanNLP 和 Spacy Backend。问题是,在解析文本后,我想看看一个句子是否有标记不同关系的标记。
假设,
library(cleanNLP)
library(tidyverse)
text <- cnlp_annotate(c("I gave him money"))
结果会是
doc_id sid tid token token_with_ws lemma upos xpos tid_source relation
<int> <int> <int> <chr> <chr> <chr> <chr> <chr> <int> <chr>
1 1 1 1 I "I " -PRON- PRON PRP 2 nsubj
2 1 1 2 gave "gave " give VERB VBD 0 root
3 1 1 3 money "money " money NOUN NN 2 dobj
4 1 1 4 to "to " to ADP IN 2 dative
5 1 1 5 him "him" -PRON- PRON PRP 4 pobj
我通过
改变了数据框dative <- c("dative")
anno %>%
+ filter(grepl(dative, relation)) %>%
+ select(sid, sentence)
并查找前后上下文
anno %>%
+ mutate(kwic = ifelse(grepl(dative, relation),
+ TRUE, FALSE)) %>%
+ mutate(before = gsub("NA\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
+ after = gsub("NA\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
+ ) %>%
+ filter(kwic) %>%
+ select(before, token, after)
我想从语料库中提取包含所有三个关系标签的句子 (dobj, dative, pobj
)。换句话说,如果上下文前后有标签 "dobj"
和 "pobj"
.
所以基本上,我想提取带有模式 Dobj、Dative、Pobj 的句子(带有双宾语的句子;我给了他钱)但不是带有一个或两个变量的模式,比如说 Dobj只要; I gave the money or 介词 + Pobj;我给了他
我该怎么做?非常感谢任何帮助
到目前为止,在@GeoffreyPoole 的大力帮助下,我已经设法获得了名单。通过对下面的代码进行一些编辑,输出为;
target <- "root dobj dative pobj"
text %>%
select(sid, relation, lemma) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 4) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid, lemma) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y,z) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,c("relation", "token")], 4, paste, collapse = " ") %>%
as.data.frame
}
) %>%
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)
sid relation token doc_id tid token_with_ws lemma upos xpos tid_source
<int> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <int>
1 949 root dobj dative pobj gives ideas to people NA NA NA NA NA NA NA
2 1242 root dobj dative pobj provided advantages for customers NA NA NA NA NA NA NA
3 1631 root dobj dative pobj give harm to themselves NA NA NA NA NA NA NA
4 2275 root dobj dative pobj say this to us NA NA NA NA NA NA NA
5 3016 root dobj dative pobj write fine to you NA NA NA NA NA NA NA
6 3826 root dobj dative pobj cause problem for society NA NA NA NA NA NA NA
7 4184 root dobj dative pobj gives harm to women NA NA NA NA NA NA NA
只剩下一个问题了,我需要编辑target
才能看到更多关系吗?例如当target <- "root dobj dative pobj"
,
结果是
1242 root dobj dative pobj provided advantages for customers
如果实际句子是
会怎样"provided advantages for
the
customers"
我是否需要将 target
重写为 "root dobj dative (det) pobj"
才能观察到这些模式?
谢谢。
@Fatih 提出的修改后的问题让我意识到这个问题的答案比我最初发布的更可靠(也更有效)。
关键是用词类而不是标记(单词)本身来造句。然后使用regex
(例如grepl()
)找到具有所需模式的“句子”。
测试数据如下:
> text
# A tibble: 16 x 4
sid tid token upos
<int> <int> <chr> <chr>
1 1 1 When ADV
2 1 2 you PRON
3 1 3 ’re VERB
4 1 4 traveling VERB
5 2 1 You PRON
6 2 2 also ADV
7 2 3 see VERB
8 2 4 a DET
9 3 1 These DET
10 3 2 strings NOUN
11 3 3 of ADP
12 3 4 beads NOUN
13 4 1 They PRON
14 4 2 have AUX
15 4 3 been AUX
16 4 4 used VERB
假设我们要查找模式为“ADV VERB”或“ADV PRON VERB”的句子。正则表达式如下所示:
regex = "ADV (PRON )?VERB"
所以让我们用词性构建一些“句子”:
library(dplyr)
posSentences =
text %>%
arrange(sid, tid) %>%
group_by(sid) %>%
summarize(uposSentence = paste(upos, collapse = " "))
“句子”看起来像这样:
> posSentences
# A tibble: 4 x 2
sid uposSentence
<int> <chr>
1 1 ADV PRON VERB VERB
2 2 PRON ADV VERB DET
3 3 DET NOUN ADP NOUN
4 4 PRON AUX AUX VERB
你可以看到前两个句子有我们想要的模式。后两个没有。现在只要用grepl
就可以找到符合正则表达式的:
theAnswer = filter(posSentences, grepl(regex, posSentences$uposSentence))
我们完成了:
> theAnswer
# A tibble: 2 x 2
sid uposSentence
<int> <chr>
1 1 ADV PRON VERB VERB
2 2 PRON ADV VERB DET
您可以通过以下方式返回到这些句子中的标记:
filter(text, sid %in% theAnswer$sid)
在这种情况下会产生:
# A tibble: 8 x 4
sid tid token upos
<int> <int> <chr> <chr>
1 1 1 When ADV
2 1 2 you PRON
3 1 3 ’re VERB
4 1 4 traveling VERB
5 2 1 You PRON
6 2 2 also ADV
7 2 3 see VERB
8 2 4 a DET
上述方法比我在@Fatih 的问题范围更窄时提供的方法(寻找三个词性的特定模式)快得多,也更灵活。所以我以前的回答没有实际意义,但我把它留在下面以防它对任何人有用。
原始答案(针对 3 个值的特定模式)
这是使用 dplyr::group_modify
和 zoo::rollapply
的解决方案。基本上,通过将 rollapply
包裹在 group_modify
中,您可以 rollapply
跨越每个句子,并且 paste
每个三元组关系一起成为一个字符串。然后,只需 filter
即可获得所需的 target
字符串。根据您的 objective.
text
中的所有标点符号
library(zoo)
library(dplyr)
target = "dobj dative pobj"
text %>%
select(sid, relation) %>%
# get rid of any sentences with less than three words...
group_by(sid) %>%
summarize(n = n()) %>%
filter(n >= 3) %>%
left_join(text) %>%
# make sure tokens are in order...
arrange(sid, tid) %>%
# now, for each sentence...
group_by(sid) %>%
group_modify(
function(x,y) {
#paste together each triplet of relations and convert to a dataframe.
rollapply(x[,"relation"], 3, paste, collapse = " ") %>%
as.data.frame
}
) %>%
# get all unique combinations of sid and pasted triplets
distinct %>%
# select records with the desired pasted triplet
filter(relation == target) %>%
# and pull all of the tokens for associated sentences from text
left_join(text)