条件删除行:删除准相同但不相同的行
Conditional delete rows: delete the quasi-identical rows but not the identical
我有一个数据框,我需要根据行中 "quasi-identical" 的两个值对其进行净化。我只需要删除不同但不相同的观察结果。我尝试使用 agrep
执行此操作,但此函数也会删除相同的观察结果。
Id<-c("RoLu1976","Rolu1976","RoLu1976","AlBl1989","ThSa1996")
Art<-c("Econometric Policy Evaluation: A Critique","Econometric Policy Evaluations A Critique","Econometric Policy Evaluation: A Critique", "Rules after discretion", "Expectations and the Nonneutrality of Lucas")
Id.1<-c("FiKy1989","FiKy1989","BeBe1983","JoSt1989","JoSt1990")
Art.1<-c("Notes on the Lucas Critique","Notes on the Lucas Critique","The Inconsistency of Optimal Plans","The Inconsistency","Notes on the Lucas")
N<-data.frame(Id,Art,Id.1,Art.1)
上面dataframe
中的准相同值在第一个观察的Art
列中,仅s
和:
不同。
在上述情况下,最终数据框应该是(注意相同的值没有被删除):
Id Art Id.1 Art.1
RoLu1976 Econometric Policy Evaluation: A Critique FiKy1989 Notes on the Lucas Critique
RoLu1976 Econometric Policy Evaluation: A Critique BeBe1983 The Inconsistency of Optimal Plans
AlBl1989 Rules after discretion JoSt1989 The Inconsistency
ThSa1996 Expectations and the Nonneutrality of Lucas JoSt1990 Notes on the Lucas
我做的是:
yy = NULL
for(i in 1:length(N$Art)){
temp = agrep(N[i,"Art"],N$Art,value=T)
y = ifelse(any(N[i,"Art"]==temp),temp[1],N[i,"Art"])
yy = c(yy,y)
}
N$Art = yy
N.2 = N[!duplicated(N$Art), ]
但它删除了两个值:相同和准相同。
我该怎么做?
你可以把原来的Art列中相同的东西的索引存起来,和去重后的结果结合使用,例如
originallyDuplicated <- duplicated(N$Art)
# then run your snippet to generate `yy`
所以你想摆脱现在但原来.
重复的东西
N[!(duplicated(yy) & !originallyDuplicated),]
虽然在我看来,与其将排除标准完全基于 Art
列,不如排除行更有意义,如果 每列 该行在 table 的其他地方重复(或几乎重复)。 (例如,也比较 Art.1、Id.1、ID 等列?)
我有一个数据框,我需要根据行中 "quasi-identical" 的两个值对其进行净化。我只需要删除不同但不相同的观察结果。我尝试使用 agrep
执行此操作,但此函数也会删除相同的观察结果。
Id<-c("RoLu1976","Rolu1976","RoLu1976","AlBl1989","ThSa1996")
Art<-c("Econometric Policy Evaluation: A Critique","Econometric Policy Evaluations A Critique","Econometric Policy Evaluation: A Critique", "Rules after discretion", "Expectations and the Nonneutrality of Lucas")
Id.1<-c("FiKy1989","FiKy1989","BeBe1983","JoSt1989","JoSt1990")
Art.1<-c("Notes on the Lucas Critique","Notes on the Lucas Critique","The Inconsistency of Optimal Plans","The Inconsistency","Notes on the Lucas")
N<-data.frame(Id,Art,Id.1,Art.1)
上面dataframe
中的准相同值在第一个观察的Art
列中,仅s
和:
不同。
在上述情况下,最终数据框应该是(注意相同的值没有被删除):
Id Art Id.1 Art.1
RoLu1976 Econometric Policy Evaluation: A Critique FiKy1989 Notes on the Lucas Critique
RoLu1976 Econometric Policy Evaluation: A Critique BeBe1983 The Inconsistency of Optimal Plans
AlBl1989 Rules after discretion JoSt1989 The Inconsistency
ThSa1996 Expectations and the Nonneutrality of Lucas JoSt1990 Notes on the Lucas
我做的是
yy = NULL
for(i in 1:length(N$Art)){
temp = agrep(N[i,"Art"],N$Art,value=T)
y = ifelse(any(N[i,"Art"]==temp),temp[1],N[i,"Art"])
yy = c(yy,y)
}
N$Art = yy
N.2 = N[!duplicated(N$Art), ]
但它删除了两个值:相同和准相同。
我该怎么做?
你可以把原来的Art列中相同的东西的索引存起来,和去重后的结果结合使用,例如
originallyDuplicated <- duplicated(N$Art)
# then run your snippet to generate `yy`
所以你想摆脱现在但原来.
重复的东西N[!(duplicated(yy) & !originallyDuplicated),]
虽然在我看来,与其将排除标准完全基于 Art
列,不如排除行更有意义,如果 每列 该行在 table 的其他地方重复(或几乎重复)。 (例如,也比较 Art.1、Id.1、ID 等列?)