通过匹配和替换匿名化段落变量中的名称
Anonymize names in paragraph variable by matching and replacement
我正在分析学校的学生成绩单数据库。我的数据集包含大约 3000 条结构类似于以下示例的记录。每一次观察都是一位老师对一位学生的评价。每个观察都包含一个三句叙述评论。
为了分享我的分析结果,我想从评论中删除提及学生姓名的内容,并将其替换为其他姓名。在理想情况下,为了可重现性,我还想共享数据库的匿名版本。
学生姓名的不一致使用(名字与昵称与全名)和学生姓名的非结构化使用使得这对像我这样的业余爱好者来说非常棘手。我解决这个问题的尝试是将评论作为语料库中的文档来处理,并使用编写一个使用 tm::removeWords
的函数,但它对我不起作用。提前致谢!
示例数据(dput of table here)
Teacher Subject Student.Name Comment
1 Black Math Richard (Dick) Dick is a terrible student-- why hasn't he been kicked out yet?
2 Black Math Elizabeth (Betty) Betty procrastinates, but does good work.
3 Black Math Mary Grace (MG) As her teacher, I think MG is my favorite.
4 Brown English Richard (Dick) Richard is terrible at turning in homework.
5 Brown English Elizabeth (Betty) Elizabeth's work is interfering with her studies.
6 Brown English Mary Grace (MG) Mary Grace should be a teacher someday.
7 Blue P.E. Richard (Dick) Richard (Dick) kicked more field goals than any other student.
8 Blue P.E. Elizabeth (Betty) Elizabeth (Betty) needs to work to communicate on the field.
9 Blue P.E. Mary Grace (MG) Mary Grace (MG) needs to stop insulting the teacher
所需数据
Teacher Subject Student Name Comment
Black Math A A is a terrible student-- why hasn't he been kicked out yet?
Black Math B B procrastinates, but does good work.
Black Math C As her teacher, I think C is my favorite.
Brown English A A is terrible at turning in homework
Brown English B B's work is interfering with her studies.
Brown English C C should be a teacher someday.
Blue P.E. A A kicked more field goals than any other student.
Blue P.E. B B needs to work to communicate on the field.
Blue P.E. C C needs to stop insulting the teacher
N.B.
四个月前,我问 a version of this question 没有回复。我认为这有助于展示我的解决方案,但也许 tm
包没有被广泛使用。所以这是另一个镜头。
我认为对此没有简单的一刀切的解决方案。我可能会尝试使用正则表达式。
## load dput data
#eval(parse(text=paste0(readLines("http://pastebin.com/raw/MbghGybd", warn = F), collapse="\n")))
# anonymize:
r <- regexec("(\w+)\s(?:(\w+)\s)?\((\w+)\)", levels(reports$Student.Name))
m <- regmatches(levels(reports$Student.Name), r)
names(m) <- levels(reports$Student.Name)
m <- lapply(m, function(x) {
paste(sprintf("%s\s*\(%s\)", x[2], x[4]), sprintf("%s %s \(%s\)", x[2], x[3], x[4]), x[2], x[4], paste(x[2], x[3], sep=" "), sep="|")
})
rep <- split(reports, reports$Student.Name)
for (x in seq_along(names(rep))) {
rep[[x]]$Comment <- gsub(m[[names(rep)[x]]], x, rep[[x]]$Comment, perl=TRUE)
}
transform(do.call(rbind, rep), Student.Name=as.integer(Student.Name))
# Teacher Subject Student.Name Comment
# Elizabeth (Betty).2 Black Math 1 1 procrastinates, but does good work.
# Elizabeth (Betty).5 Brown English 1 1's work is interfering with her studies.
# Elizabeth (Betty).8 Blue P.E. 1 1 needs to work to communicate on the field.
# Mary Grace (MG).3 Black Math 2 As her teacher, I think 2 is my favorite.
# Mary Grace (MG).6 Brown English 2 2 Grace should be a teacher someday.
# Mary Grace (MG).9 Blue P.E. 2 2 needs to stop insulting the teacher
# Richard (Dick).1 Black Math 3 3 is a terrible student-- why hasn't he been kicked out yet?
# Richard (Dick).4 Brown English 3 3 is terrible at turning in homework
# Richard (Dick).7 Blue P.E. 3 3 kicked more field goals than any other student.
但这肯定需要进行大量调整才能使您的真实数据集成形。
我会在此处使用 qdap
包中的 mgsub
。你可以做这样的事情(尽管要注意确保学生被归因于相同的 id,这对于你的例子来说可能过于具体,其中包含每个学生的昵称):
names <- unique(as.character(reports$Student.Name))
ids <- sample(100000, length(names))
tocheck <- c(
names,
unlist(regmatches(names, gregexpr("(?<=\().*?(?=\))", names, perl = T))),
gsub("\s*\([^\)]+\)","",as.character(names))
)
reports$Student.Name <- rep(ids, 3)
reports$Comment <- qdap::mgsub(tocheck, rep(ids, 3), reports$Comment)
Student.Name Comment
1 61034 61034 is a terrible student-- why hasn't he been kicked out yet?
2 45005 45005 procrastinates, but does good work.
3 13699 As her teacher, I think 13699 is my favorite.
4 61034 61034 is terrible at turning in homework
5 45005 45005's work is interfering with her studies.
6 13699 13699 should be a teacher someday.
7 61034 61034 kicked more field goals than any other student.
8 45005 45005 needs to work to communicate on the field.
9 13699 13699 needs to stop insulting the teacher
我正在分析学校的学生成绩单数据库。我的数据集包含大约 3000 条结构类似于以下示例的记录。每一次观察都是一位老师对一位学生的评价。每个观察都包含一个三句叙述评论。
为了分享我的分析结果,我想从评论中删除提及学生姓名的内容,并将其替换为其他姓名。在理想情况下,为了可重现性,我还想共享数据库的匿名版本。
学生姓名的不一致使用(名字与昵称与全名)和学生姓名的非结构化使用使得这对像我这样的业余爱好者来说非常棘手。我解决这个问题的尝试是将评论作为语料库中的文档来处理,并使用编写一个使用 tm::removeWords
的函数,但它对我不起作用。提前致谢!
示例数据(dput of table here)
Teacher Subject Student.Name Comment
1 Black Math Richard (Dick) Dick is a terrible student-- why hasn't he been kicked out yet?
2 Black Math Elizabeth (Betty) Betty procrastinates, but does good work.
3 Black Math Mary Grace (MG) As her teacher, I think MG is my favorite.
4 Brown English Richard (Dick) Richard is terrible at turning in homework.
5 Brown English Elizabeth (Betty) Elizabeth's work is interfering with her studies.
6 Brown English Mary Grace (MG) Mary Grace should be a teacher someday.
7 Blue P.E. Richard (Dick) Richard (Dick) kicked more field goals than any other student.
8 Blue P.E. Elizabeth (Betty) Elizabeth (Betty) needs to work to communicate on the field.
9 Blue P.E. Mary Grace (MG) Mary Grace (MG) needs to stop insulting the teacher
所需数据
Teacher Subject Student Name Comment
Black Math A A is a terrible student-- why hasn't he been kicked out yet?
Black Math B B procrastinates, but does good work.
Black Math C As her teacher, I think C is my favorite.
Brown English A A is terrible at turning in homework
Brown English B B's work is interfering with her studies.
Brown English C C should be a teacher someday.
Blue P.E. A A kicked more field goals than any other student.
Blue P.E. B B needs to work to communicate on the field.
Blue P.E. C C needs to stop insulting the teacher
N.B.
四个月前,我问 a version of this question 没有回复。我认为这有助于展示我的解决方案,但也许 tm
包没有被广泛使用。所以这是另一个镜头。
我认为对此没有简单的一刀切的解决方案。我可能会尝试使用正则表达式。
## load dput data
#eval(parse(text=paste0(readLines("http://pastebin.com/raw/MbghGybd", warn = F), collapse="\n")))
# anonymize:
r <- regexec("(\w+)\s(?:(\w+)\s)?\((\w+)\)", levels(reports$Student.Name))
m <- regmatches(levels(reports$Student.Name), r)
names(m) <- levels(reports$Student.Name)
m <- lapply(m, function(x) {
paste(sprintf("%s\s*\(%s\)", x[2], x[4]), sprintf("%s %s \(%s\)", x[2], x[3], x[4]), x[2], x[4], paste(x[2], x[3], sep=" "), sep="|")
})
rep <- split(reports, reports$Student.Name)
for (x in seq_along(names(rep))) {
rep[[x]]$Comment <- gsub(m[[names(rep)[x]]], x, rep[[x]]$Comment, perl=TRUE)
}
transform(do.call(rbind, rep), Student.Name=as.integer(Student.Name))
# Teacher Subject Student.Name Comment
# Elizabeth (Betty).2 Black Math 1 1 procrastinates, but does good work.
# Elizabeth (Betty).5 Brown English 1 1's work is interfering with her studies.
# Elizabeth (Betty).8 Blue P.E. 1 1 needs to work to communicate on the field.
# Mary Grace (MG).3 Black Math 2 As her teacher, I think 2 is my favorite.
# Mary Grace (MG).6 Brown English 2 2 Grace should be a teacher someday.
# Mary Grace (MG).9 Blue P.E. 2 2 needs to stop insulting the teacher
# Richard (Dick).1 Black Math 3 3 is a terrible student-- why hasn't he been kicked out yet?
# Richard (Dick).4 Brown English 3 3 is terrible at turning in homework
# Richard (Dick).7 Blue P.E. 3 3 kicked more field goals than any other student.
但这肯定需要进行大量调整才能使您的真实数据集成形。
我会在此处使用 qdap
包中的 mgsub
。你可以做这样的事情(尽管要注意确保学生被归因于相同的 id,这对于你的例子来说可能过于具体,其中包含每个学生的昵称):
names <- unique(as.character(reports$Student.Name))
ids <- sample(100000, length(names))
tocheck <- c(
names,
unlist(regmatches(names, gregexpr("(?<=\().*?(?=\))", names, perl = T))),
gsub("\s*\([^\)]+\)","",as.character(names))
)
reports$Student.Name <- rep(ids, 3)
reports$Comment <- qdap::mgsub(tocheck, rep(ids, 3), reports$Comment)
Student.Name Comment
1 61034 61034 is a terrible student-- why hasn't he been kicked out yet?
2 45005 45005 procrastinates, but does good work.
3 13699 As her teacher, I think 13699 is my favorite.
4 61034 61034 is terrible at turning in homework
5 45005 45005's work is interfering with her studies.
6 13699 13699 should be a teacher someday.
7 61034 61034 kicked more field goals than any other student.
8 45005 45005 needs to work to communicate on the field.
9 13699 13699 needs to stop insulting the teacher