部分 grepl 导致将关键字与多列的文本字符串匹配
partial grepl result in matching keywords with text strings for multiple columns
我有一个诊断列表,我想根据关键字对它们进行分组。因此,如果 ref[[1]]
中的一个关键字在 mh$prb
中找到,则 mh$group
得到 1。我 运行 遇到 grepl
的问题是我的一些关键字正在匹配而另一些则没有——即使它们存在。我在 ref
中有这样的关键字:
为了分配诊断组,我做了以下操作using this example:
mh$group <- ifelse(grepl(ref[[1]], mh$prb), 1,
ifelse(grepl(ref[[2]], mh$prb), 2,
ifelse(grepl(ref[[3]], mh$prb), 3,
ifelse(grepl(ref[[4]], mh$prb), 4,
ifelse(grepl(ref[[5]], mh$prb), 5,
ifelse(grepl(ref[[6]], mh$prb), 6,
ifelse(grepl(ref[[7]], mh$prb), 7, 0
)))))))
而且,如您所见,我有一个部分匹配,其中一些关键字被标记,而另一些则没有。例如,'depression' 正在分配,而 'bipolar' 未分配。
> head(mh)
prb group
<chr> <dbl>
1 unspecified major depression single episode 2.00
2 bipolar disorder unspecified 0
3 unspecified major depression recurrent episode 2.00
4 bipolar disorder unspecified 0
5 alcohol abuse unspecified 7.00
6 cocaine dependence uncomplicated 0
所以我隔离了一个测试例子。您可以看到 t
df 有 bipolar
,我的 ref
.
也有
> t <- filter(mh, prb == "bipolar disorder unspecified")
> ref[[2]]
[1] "major| depression| depressive| bipolar| manic| mood| substance induced mood| substance induced mood| alcohol induced mood| alcohol induced mood| cocaine induced mood| cocaine induced mood| amphetamine induced mood| amphetamine induced mood| opioid induced mood| opioid induced mood| cannabis induced mood| cannabis induced mood| marijuana induced mood| marijuana induced mood| methamphetamine induced mood| methamphetamine induced mood| sedative| hypnotic anxiolytic induced mood"
> grepl("bipolar", t$prb)
[1] TRUE
> grepl("bipolar", ref[[2]])
[1] TRUE
> grepl(t$prb, ref[[2]])
[1] FALSE
> grepl(ref[[2]], t$prb)
[1] FALSE
因此,"bipolar" 对于 ref[[2]]
和 t$prb
单独而言都是 TRUE,但当一起比较时则不是 TRUE。我哪里搞砸了?
编辑:
> dput(ref)
c("psychotic| schizophrenia| schizo| psychosis| delusional| delusion| paranoid| undifferentiated| disorganized| substance induced psychotic| substance induced psychosis| alcohol induced psychotic| alcohol induced psychosis| cocaine induced psychosis| cocaine induced psychotic| amphetamine induced psychosis| amphetamine induced psychotic| opioid induced psychosis| opioid induced psychotic| cannabis induced psychosis| cannabis induced psychotic| marijuana induced psychosis| marijuana induced psychotic| methamphetamine induced psychosis| methamphetamine induced psychotic| hallucinogen induced psychosis| hallucinogen induced psychotic| PCP induced psychosis| PCP induced psychotic| benzodiazepine induced psychosis| benzodiazepine induced psychotic| phencyclidine induced psychosis| phencyclidine induced psychotic",
"major| depression| depressive| bipolar| manic| mood| substance induced mood| substance induced mood| alcohol induced mood| alcohol induced mood| cocaine induced mood| cocaine induced mood| amphetamine induced mood| amphetamine induced mood| opioid induced mood| opioid induced mood| cannabis induced mood| cannabis induced mood| marijuana induced mood| marijuana induced mood| methamphetamine induced mood| methamphetamine induced mood| sedative| hypnotic anxiolytic induced mood",
"post| traumatic| PTSD| panic| intermittent| explosive", "borderline| schizoid| schizotypal| paranoid",
"neuro| neurocognitive| cognitive| dementia| alzheimers| vascular",
"autism| aspergers| spectrum| retardation| intellectual| disability",
"alcohol| cannabis| marijuana| opioid| heroin| amphetamine| methamphetamine| cocaine| inhalant| hallucinogen| PCP| sedative| hypnotic| anxiolytic| benzodiazepine| Xanax| valium| phencyclidine| induced| substance induced| alcohol induced| cannabis induced| marijuana induced| opioid induced| heroin induced| amphetamine induced| methamphetamine induced| cocaine induced| inhalant induced| hallucinogen induced| PCP induced| sedative induced| hypnotic induced| anxiolytic induced| benzodiazepine induced| Xanax induced| valium induced| phencyclidine induced"
)
> dput(head(mh))
structure(list(prb = c("unspecified major depression single episode",
"bipolar disorder unspecified", "unspecified major depression recurrent episode",
"bipolar disorder unspecified", "alcohol abuse unspecified",
"cocaine dependence uncomplicated")), .Names = "prb", row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
导致问题的原因是您定义 ref 变量的方式。当您将 或 指定为“| bipolar”时,grep 正在寻找 space 后跟单词 "bipolar" 因此您会丢失所有符合条件的匹配项是第一个词。
要修复,请尝试使用“|bipolar”(将在复合词中查找条件)或“|bipolar”(将查找除句子中最后一个单词之外的单独单词)。
现在要批量修复 "ref" 变量而无需手动删除所有额外的 space,您可以使用 grep。 |是特殊字符,需要二次转义
ref<-gsub("\| ", "\|", ref)
#For example
ref[5]
[1]
"neuro|neurocognitive|cognitive|dementia|alzheimers|vascular"
现在:
ifelse(grepl(ref[[1]], mh$prb), 1,.... )))))))
将产生:
[1] 2 2 2 2 7 7
我有一个诊断列表,我想根据关键字对它们进行分组。因此,如果 ref[[1]]
中的一个关键字在 mh$prb
中找到,则 mh$group
得到 1。我 运行 遇到 grepl
的问题是我的一些关键字正在匹配而另一些则没有——即使它们存在。我在 ref
中有这样的关键字:
为了分配诊断组,我做了以下操作using this example:
mh$group <- ifelse(grepl(ref[[1]], mh$prb), 1,
ifelse(grepl(ref[[2]], mh$prb), 2,
ifelse(grepl(ref[[3]], mh$prb), 3,
ifelse(grepl(ref[[4]], mh$prb), 4,
ifelse(grepl(ref[[5]], mh$prb), 5,
ifelse(grepl(ref[[6]], mh$prb), 6,
ifelse(grepl(ref[[7]], mh$prb), 7, 0
)))))))
而且,如您所见,我有一个部分匹配,其中一些关键字被标记,而另一些则没有。例如,'depression' 正在分配,而 'bipolar' 未分配。
> head(mh)
prb group
<chr> <dbl>
1 unspecified major depression single episode 2.00
2 bipolar disorder unspecified 0
3 unspecified major depression recurrent episode 2.00
4 bipolar disorder unspecified 0
5 alcohol abuse unspecified 7.00
6 cocaine dependence uncomplicated 0
所以我隔离了一个测试例子。您可以看到 t
df 有 bipolar
,我的 ref
.
> t <- filter(mh, prb == "bipolar disorder unspecified")
> ref[[2]]
[1] "major| depression| depressive| bipolar| manic| mood| substance induced mood| substance induced mood| alcohol induced mood| alcohol induced mood| cocaine induced mood| cocaine induced mood| amphetamine induced mood| amphetamine induced mood| opioid induced mood| opioid induced mood| cannabis induced mood| cannabis induced mood| marijuana induced mood| marijuana induced mood| methamphetamine induced mood| methamphetamine induced mood| sedative| hypnotic anxiolytic induced mood"
> grepl("bipolar", t$prb)
[1] TRUE
> grepl("bipolar", ref[[2]])
[1] TRUE
> grepl(t$prb, ref[[2]])
[1] FALSE
> grepl(ref[[2]], t$prb)
[1] FALSE
因此,"bipolar" 对于 ref[[2]]
和 t$prb
单独而言都是 TRUE,但当一起比较时则不是 TRUE。我哪里搞砸了?
编辑:
> dput(ref)
c("psychotic| schizophrenia| schizo| psychosis| delusional| delusion| paranoid| undifferentiated| disorganized| substance induced psychotic| substance induced psychosis| alcohol induced psychotic| alcohol induced psychosis| cocaine induced psychosis| cocaine induced psychotic| amphetamine induced psychosis| amphetamine induced psychotic| opioid induced psychosis| opioid induced psychotic| cannabis induced psychosis| cannabis induced psychotic| marijuana induced psychosis| marijuana induced psychotic| methamphetamine induced psychosis| methamphetamine induced psychotic| hallucinogen induced psychosis| hallucinogen induced psychotic| PCP induced psychosis| PCP induced psychotic| benzodiazepine induced psychosis| benzodiazepine induced psychotic| phencyclidine induced psychosis| phencyclidine induced psychotic",
"major| depression| depressive| bipolar| manic| mood| substance induced mood| substance induced mood| alcohol induced mood| alcohol induced mood| cocaine induced mood| cocaine induced mood| amphetamine induced mood| amphetamine induced mood| opioid induced mood| opioid induced mood| cannabis induced mood| cannabis induced mood| marijuana induced mood| marijuana induced mood| methamphetamine induced mood| methamphetamine induced mood| sedative| hypnotic anxiolytic induced mood",
"post| traumatic| PTSD| panic| intermittent| explosive", "borderline| schizoid| schizotypal| paranoid",
"neuro| neurocognitive| cognitive| dementia| alzheimers| vascular",
"autism| aspergers| spectrum| retardation| intellectual| disability",
"alcohol| cannabis| marijuana| opioid| heroin| amphetamine| methamphetamine| cocaine| inhalant| hallucinogen| PCP| sedative| hypnotic| anxiolytic| benzodiazepine| Xanax| valium| phencyclidine| induced| substance induced| alcohol induced| cannabis induced| marijuana induced| opioid induced| heroin induced| amphetamine induced| methamphetamine induced| cocaine induced| inhalant induced| hallucinogen induced| PCP induced| sedative induced| hypnotic induced| anxiolytic induced| benzodiazepine induced| Xanax induced| valium induced| phencyclidine induced"
)
> dput(head(mh))
structure(list(prb = c("unspecified major depression single episode",
"bipolar disorder unspecified", "unspecified major depression recurrent episode",
"bipolar disorder unspecified", "alcohol abuse unspecified",
"cocaine dependence uncomplicated")), .Names = "prb", row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
导致问题的原因是您定义 ref 变量的方式。当您将 或 指定为“| bipolar”时,grep 正在寻找 space 后跟单词 "bipolar" 因此您会丢失所有符合条件的匹配项是第一个词。 要修复,请尝试使用“|bipolar”(将在复合词中查找条件)或“|bipolar”(将查找除句子中最后一个单词之外的单独单词)。
现在要批量修复 "ref" 变量而无需手动删除所有额外的 space,您可以使用 grep。 |是特殊字符,需要二次转义
ref<-gsub("\| ", "\|", ref)
#For example
ref[5]
[1] "neuro|neurocognitive|cognitive|dementia|alzheimers|vascular"
现在:
ifelse(grepl(ref[[1]], mh$prb), 1,.... )))))))
将产生:
[1] 2 2 2 2 7 7