如何使用 R 提取包含特定人名的句子
How to extract sentences containing specific person names using R
我正在使用 R 从文本中提取包含特定人名的句子,这里是一个示例段落:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.
在这个简短的段落中,有几个人名,例如:
Johann Reuchlin、Melanchthon、Johann Eck。借助openNLP包,三个人的名字Martin Luther、Paul和 Melanchthon 可以正确提取和识别。那我有两个问题:
- 我如何提取 包含这些名称的句子?
- 由于命名实体识别器的输出不太乐观,如果我在每个名称(例如 [[Johann Reuchlin]]、[[Melanchthon]] 后添加“[[ ]]”,我如何提取 包含这些名称表达式的句子 [[A]], [[B]] ...?
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\.")))]
> unlist(strsplit(para,split="\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
或更清洁一点:
sentences<-unlist(strsplit(para,split="\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
如果您要查找每个人所在的句子 returns 那么:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
编辑 3:要添加每个人的姓名,请执行一些简单的操作,例如:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
编辑 4:
如果您想查找包含多个 people/places/things(单词)的句子,则只需为这两个添加一个参数,例如:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
并将 perl
更改为 TRUE
:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
编辑 5:回答您的其他问题:
给定:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\[\[|\]\]", "", regmatches(sentenceR, gregexpr("\[\[.*?\]\]", sentenceR))[[1]])
会给你双括号里面的字
> gsub("\[\[|\]\]", "", regmatches(sentenceR, gregexpr("\[\[.*?\]\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
这是一个相当简单的方法,使用两个包 quanteda 和 stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."
我正在使用 R 从文本中提取包含特定人名的句子,这里是一个示例段落:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.
在这个简短的段落中,有几个人名,例如: Johann Reuchlin、Melanchthon、Johann Eck。借助openNLP包,三个人的名字Martin Luther、Paul和 Melanchthon 可以正确提取和识别。那我有两个问题:
- 我如何提取 包含这些名称的句子?
- 由于命名实体识别器的输出不太乐观,如果我在每个名称(例如 [[Johann Reuchlin]]、[[Melanchthon]] 后添加“[[ ]]”,我如何提取 包含这些名称表达式的句子 [[A]], [[B]] ...?
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\.")))]
> unlist(strsplit(para,split="\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
或更清洁一点:
sentences<-unlist(strsplit(para,split="\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
如果您要查找每个人所在的句子 returns 那么:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
编辑 3:要添加每个人的姓名,请执行一些简单的操作,例如:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
编辑 4:
如果您想查找包含多个 people/places/things(单词)的句子,则只需为这两个添加一个参数,例如:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
并将 perl
更改为 TRUE
:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
编辑 5:回答您的其他问题:
给定:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\[\[|\]\]", "", regmatches(sentenceR, gregexpr("\[\[.*?\]\]", sentenceR))[[1]])
会给你双括号里面的字
> gsub("\[\[|\]\]", "", regmatches(sentenceR, gregexpr("\[\[.*?\]\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
这是一个相当简单的方法,使用两个包 quanteda 和 stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."