删除行 w。重复信息,同时保留第一个非重复条目(并将重复条目中的数据附加到该行)
Deleting rows w. duplicate information while keeping first non-duplicated entry (and appending that row with data from duplicate entries)
我有一个包含 50 万个条目(行)的数据集。每个条目都是针对特定学生的,并包含有关该学生所在学校的信息
去了那个特定的学期。
因为学生在同一所学校呆了几个学期,所以我有很多关于同一名学生和同一所学校的参赛作品
(只有学期发生变化,即 EnrollmentBegin 和 EnrollmentEnd)。
FirstName LastName CollegeName State PublicPrivate EnrollmentBegin EnrollmentEnd
John Doe School A NY Public 20050829 20051223
John Doe School A NY Public 20051229 20060113
John Doe School A NY Public 20051223 20060513
John Doe School B IL Private 20090105 20090301
John Doe School B IL Private 20090706 20090830
John Doe School B IL Private 20090831 20091025
Jane Doe School A IL Private 20100105 20100301
Jane Doe School A IL Private 20100706 20100830
Jane Doe School A IL Private 20100831 20101025
John Doe School A NY Public 20110829 20111223
John Doe School A NY Public 20120129 20120513
这意味着对于一些学生,我有很多学生姓名和学院名称相同的条目。
我真的只想要每个新条目的第一个实例(即每当给定学生的学校名称更改时)
但我还需要知道学生在那所学校的入学时间何时结束。
此信息可在给定学校每个学生的最后一个条目中找到。
所以我需要从最后一个条目中获取该值,并添加到包含学生第一个条目的行中的新列。
注意:我意识到有些学生,比如上面的 John Doe,去学校 A,去另一所学校,然后回到学校 A。理想情况下,捕捉
那,我希望我的最终数据集看起来像这样:
FirstName LastName CollegeName State PublicPrivate EnrollmentBegin EnrollmentEnd EnrollmentEnd
John Doe School A NY Public 20050829 20051223 20060513
John Doe School A NY Public 20110829 20111223 20120513
John Doe School B IL Private 20090105 20090301 20091025
Jane Doe School A IL Private 20100105 20100301 20101025
如何以最有效的方式执行此操作?好像min和max已经不能解决这个问题了...
尝试
library(data.table)
setDT(df1)[,list(EnrollmentBegin= EnrollmentBegin[1L],
EnrollmentEnd=EnrollmentEnd[1L],
EnrollmentEnd2= EnrollmentEnd[.N]) ,
by =c(names(df1)[1:5])]
# FirstName LastName CollegeName State PublicPrivate EnrollmentBegin
#1: John Doe School A NY Public 20050829
#2: John Doe School B IL Private 20090105
#3: Jane Doe School A IL Private 20100105
# EnrollmentEnd EnrollmentEnd2
#1: 20051223 20060513
#2: 20090301 20091025
#3: 20100301 20101025
或使用dplyr
library(dplyr)
df1 %>%
group_by_(.dots=names(df1)[1:5]) %>%
summarise(EnrollmentBegin=EnrollmentBegin[1L],
EnrollmentEnd1=EnrollmentEnd[1L],
EnrollmentEnd2 = EnrollmentEnd[n()])
使用基数 R 的替代方案 lapply
lst = unname(split(dat, dat[,1:5])[lapply(split(dat, dat[,1:5]), nrow) != 0])
out = do.call(rbind, lapply(lst,
function(x){x$EnrollmentEnd.new = x$EnrollmentEnd[nrow(x)]; x[1,]}))
#> out
# FirstName LastName CollegeName State PublicPrivate EnrollmentBegin
#7 Jane Doe School_A IL Private 20100105
#4 John Doe School_B IL Private 20090105
#3 John Doe School_A NY Public 20050829
# EnrollmentEnd EnrollmentEnd.new
#7 20100301 20101025
#4 20090301 20091025
#3 20051223 20060513
我有一个包含 50 万个条目(行)的数据集。每个条目都是针对特定学生的,并包含有关该学生所在学校的信息 去了那个特定的学期。
因为学生在同一所学校呆了几个学期,所以我有很多关于同一名学生和同一所学校的参赛作品 (只有学期发生变化,即 EnrollmentBegin 和 EnrollmentEnd)。
FirstName LastName CollegeName State PublicPrivate EnrollmentBegin EnrollmentEnd
John Doe School A NY Public 20050829 20051223
John Doe School A NY Public 20051229 20060113
John Doe School A NY Public 20051223 20060513
John Doe School B IL Private 20090105 20090301
John Doe School B IL Private 20090706 20090830
John Doe School B IL Private 20090831 20091025
Jane Doe School A IL Private 20100105 20100301
Jane Doe School A IL Private 20100706 20100830
Jane Doe School A IL Private 20100831 20101025
John Doe School A NY Public 20110829 20111223
John Doe School A NY Public 20120129 20120513
这意味着对于一些学生,我有很多学生姓名和学院名称相同的条目。
我真的只想要每个新条目的第一个实例(即每当给定学生的学校名称更改时) 但我还需要知道学生在那所学校的入学时间何时结束。
此信息可在给定学校每个学生的最后一个条目中找到。 所以我需要从最后一个条目中获取该值,并添加到包含学生第一个条目的行中的新列。
注意:我意识到有些学生,比如上面的 John Doe,去学校 A,去另一所学校,然后回到学校 A。理想情况下,捕捉 那,我希望我的最终数据集看起来像这样:
FirstName LastName CollegeName State PublicPrivate EnrollmentBegin EnrollmentEnd EnrollmentEnd
John Doe School A NY Public 20050829 20051223 20060513
John Doe School A NY Public 20110829 20111223 20120513
John Doe School B IL Private 20090105 20090301 20091025
Jane Doe School A IL Private 20100105 20100301 20101025
如何以最有效的方式执行此操作?好像min和max已经不能解决这个问题了...
尝试
library(data.table)
setDT(df1)[,list(EnrollmentBegin= EnrollmentBegin[1L],
EnrollmentEnd=EnrollmentEnd[1L],
EnrollmentEnd2= EnrollmentEnd[.N]) ,
by =c(names(df1)[1:5])]
# FirstName LastName CollegeName State PublicPrivate EnrollmentBegin
#1: John Doe School A NY Public 20050829
#2: John Doe School B IL Private 20090105
#3: Jane Doe School A IL Private 20100105
# EnrollmentEnd EnrollmentEnd2
#1: 20051223 20060513
#2: 20090301 20091025
#3: 20100301 20101025
或使用dplyr
library(dplyr)
df1 %>%
group_by_(.dots=names(df1)[1:5]) %>%
summarise(EnrollmentBegin=EnrollmentBegin[1L],
EnrollmentEnd1=EnrollmentEnd[1L],
EnrollmentEnd2 = EnrollmentEnd[n()])
使用基数 R 的替代方案 lapply
lst = unname(split(dat, dat[,1:5])[lapply(split(dat, dat[,1:5]), nrow) != 0])
out = do.call(rbind, lapply(lst,
function(x){x$EnrollmentEnd.new = x$EnrollmentEnd[nrow(x)]; x[1,]}))
#> out
# FirstName LastName CollegeName State PublicPrivate EnrollmentBegin
#7 Jane Doe School_A IL Private 20100105
#4 John Doe School_B IL Private 20090105
#3 John Doe School_A NY Public 20050829
# EnrollmentEnd EnrollmentEnd.new
#7 20100301 20101025
#4 20090301 20091025
#3 20051223 20060513