计算以分号分隔的累积唯一因子 按名称分组
Count cumulative unique factors separated by semicolon Grouped by Name
这就是我的数据框的样子。最右边的两列是我想要的列。我正在计算每个 row.The 的唯一基金类型的累计数量,第 4 列是所有 "ActivityType" 的累计唯一计数,第 5 列是仅 "ActivityType==" 销售的累计唯一计数”。
dt <- read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL)
我试过 dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name]
但例如它将 a & a;b & c;d 计为 3 个唯一值,而不是所需的 4 个唯一值,因为这些因素由 semicolon.Kindly 分隔让我知道解决方案。
更新:我的真实数据集看起来更像这样:
dt <- read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email "" 0 0
John Conference "" 0 0
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
John Email "" 5 3
John Webinar "" 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL)
独特的累积向量需要考虑缺失值。
我认为这是实现您所追求目标的一种方式。首先添加一个用于维护输入顺序的辅助索引变量;并且 key
在 Name
上:
Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"]
setkeyv(Dt, "Name")
为了清楚起见,我使用了这个函数
n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl))))
而不是即时输入 body 的表达式 - 下面的操作足够密集,因为它没有一堆嵌套的函数调用令人费解。
最后,
Dt[Dt, allow.cartesian = TRUE][
gIdx <= i.gIdx,
.("UniqueFunds(AllTypes)" = n_usplit(FundType),
"UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])),
keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE]
# Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
# 1: John Email a 1 0
# 2: John Sale a;b 2 2
# 3: John Webinar c;d 4 2
# 4: John Sale b 4 2
# 5: John Webinar e 5 2
# 6: John Conference b;d 5 2
# 7: John Sale b;e 5 3
# 8: Tom Email a 1 0
# 9: Tom Sale a;b 2 2
# 10: Tom Webinar c;d 4 2
# 11: Tom Sale b 4 2
# 12: Tom Webinar e 5 2
# 13: Tom Conference b;d 5 2
# 14: Tom Sale b;e;f 6 4
我觉得我可以用 SQL 更容易地解释这个问题,但我们开始吧:
- 自身加入
Dt
(通过 Name
)
- 使用额外的索引列(
gIdx
),仅考虑序列中的前(包含)行 - 这会产生某种累积效应(因为缺少更好的术语)
- 计算
UniqueFunds(...)
列 - 注意在第二种情况下完成的额外子集化 - n_usplit(FundType[ActivityType == "Sale"])
- 删除无关的索引列 (
i.gIdx
)。
由于使用笛卡尔连接,我不确定这将如何扩展,所以希望您的真实数据集不是数百万行。
数据:
library(data.table)
##
dt <- fread('
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header = TRUE)
我实现了您想要的,如下所示:
library(data.table)
library(stringr)
dt <- data.table(read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL))
dt[,UniqueFunds.AllTypes. := NULL][,UniqueFunds.SaleOnly. := NULL]
#Get the different Fund Types
vals <- unique(unlist(str_extract_all(dt$FundType,"[a-z]")))
#Construct a new set of columns indicating which fund types are present
dt[,vals:=data.table(1*t(sapply(FundType,str_detect,vals))),with=FALSE]
#Calculate UniqueFunds.AllTypes
dt[, UniqueFunds.AllTypes. :=
rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name]
#Calculate only when ActicityType == "Sale" and use cummax to achieve desired output
dt[,UniqueFunds.SaleOnly. := 0
][ActivityType == "Sale", UniqueFunds.SaleOnly. :=
rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name
][,UniqueFunds.SaleOnly. := cummax(UniqueFunds.SaleOnly.), by = Name
]
#Cleanup vals
dt[,vals := NULL, with = FALSE]
nrussell 建议编写自定义函数的简明解决方案。让我放下我得到的东西。我尝试使用 cumsum()
和 duplicated()
,就像您尝试的那样。我做了两次大手术。一个用于 alltype
,另一个用于 saleonly
。首先,我为每个名字创建了索引。然后,我拆分 FundType
并使用 splitstackshape 包中的 cSplit()
以长格式格式化数据。然后,我为每个名称的每个索引号选择了最后一行。最后只选了一栏,alltype
.
library(splitstackshape)
library(zoo)
library(data.table)
setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"][,
.SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype
二期仅售。基本上,我对待售的子集数据重复了相同的方法,即 ana
。我还创建了一个没有售卖的数据集,就是ana2
。然后,我创建了一个包含两个数据集的列表(即 l
)并绑定它们。我用 Name
和 ind
更改了数据集的顺序,为每个名称和索引号取最后一行,处理 NA(填充 NA 并将每个名称的第一个 NA 替换为 0),最后选择了一列。最后的操作是将原来的dt
、alltype
、saleonly
、
组合起来
# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana
# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2
# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly
cbind(dt, alltype, saleonly)
Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email a 1 0 1 1 0
2: John Sale a;b 2 2 2 2 2
3: John Webinar c;d 4 2 3 4 2
4: John Sale b 4 2 4 4 2
5: John Webinar e 5 2 5 5 2
6: John Conference b;d 5 2 6 5 2
7: John Sale b;e 5 3 7 5 3
8: Tom Email a 1 0 1 1 0
9: Tom Sale a;b 2 2 2 2 2
10: Tom Webinar c;d 4 2 3 4 2
11: Tom Sale b 4 2 4 4 2
12: Tom Webinar e 5 2 5 5 2
13: Tom Conference b;d 5 2 6 5 2
14: Tom Sale b;e;f 6 4 7 6 4
编辑
对于新数据集,我尝试了以下方法。基本上,我将我的方法用于这个新数据集的 saleonly 数据。修改仅在 alltype
部分。首先,我添加了索引,用 NA 替换了“”,并用具有 non-NA 值的行对数据进行了子集化。这是temp
。其余与上一个答案相同。现在我想在 FundType 中使用 NAs 的数据集,所以我使用了 setdiff()
。使用 rbindlist()
,我合并了两个数据集并创建了 temp
。其余与上一个答案相同。 sale-part 没有任何变化。我希望这对您的真实数据有用。
### all type
setDT(dt)[, ind := 1:.N, by = "Name"][,
FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype
whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype
### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly
cbind(dt, alltype, saleonly)
Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email NA 0 0 1 0 0
2: John Conference NA 0 0 2 0 0
3: John Email a 1 0 3 1 0
4: John Sale a;b 2 2 4 2 2
5: John Webinar c;d 4 2 5 4 2
6: John Sale b 4 2 6 4 2
7: John Webinar e 5 2 7 5 2
8: John Conference b;d 5 2 8 5 2
9: John Sale b;e 5 3 9 5 3
10: John Email NA 5 3 10 5 3
11: John Webinar NA 5 3 11 5 3
12: Tom Email a 1 0 1 1 0
13: Tom Sale a;b 2 2 2 2 2
14: Tom Webinar c;d 4 2 3 4 2
15: Tom Sale b 4 2 4 4 2
16: Tom Webinar e 5 2 5 5 2
17: Tom Conference b;d 5 2 6 5 2
18: Tom Sale b;e;f 6 4 7 6 4
这就是我的数据框的样子。最右边的两列是我想要的列。我正在计算每个 row.The 的唯一基金类型的累计数量,第 4 列是所有 "ActivityType" 的累计唯一计数,第 5 列是仅 "ActivityType==" 销售的累计唯一计数”。
dt <- read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL)
我试过 dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name]
但例如它将 a & a;b & c;d 计为 3 个唯一值,而不是所需的 4 个唯一值,因为这些因素由 semicolon.Kindly 分隔让我知道解决方案。
更新:我的真实数据集看起来更像这样:
dt <- read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email "" 0 0
John Conference "" 0 0
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
John Email "" 5 3
John Webinar "" 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL)
独特的累积向量需要考虑缺失值。
我认为这是实现您所追求目标的一种方式。首先添加一个用于维护输入顺序的辅助索引变量;并且 key
在 Name
上:
Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"]
setkeyv(Dt, "Name")
为了清楚起见,我使用了这个函数
n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl))))
而不是即时输入 body 的表达式 - 下面的操作足够密集,因为它没有一堆嵌套的函数调用令人费解。
最后,
Dt[Dt, allow.cartesian = TRUE][
gIdx <= i.gIdx,
.("UniqueFunds(AllTypes)" = n_usplit(FundType),
"UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])),
keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE]
# Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
# 1: John Email a 1 0
# 2: John Sale a;b 2 2
# 3: John Webinar c;d 4 2
# 4: John Sale b 4 2
# 5: John Webinar e 5 2
# 6: John Conference b;d 5 2
# 7: John Sale b;e 5 3
# 8: Tom Email a 1 0
# 9: Tom Sale a;b 2 2
# 10: Tom Webinar c;d 4 2
# 11: Tom Sale b 4 2
# 12: Tom Webinar e 5 2
# 13: Tom Conference b;d 5 2
# 14: Tom Sale b;e;f 6 4
我觉得我可以用 SQL 更容易地解释这个问题,但我们开始吧:
- 自身加入
Dt
(通过Name
) - 使用额外的索引列(
gIdx
),仅考虑序列中的前(包含)行 - 这会产生某种累积效应(因为缺少更好的术语) - 计算
UniqueFunds(...)
列 - 注意在第二种情况下完成的额外子集化 -n_usplit(FundType[ActivityType == "Sale"])
- 删除无关的索引列 (
i.gIdx
)。
由于使用笛卡尔连接,我不确定这将如何扩展,所以希望您的真实数据集不是数百万行。
数据:
library(data.table)
##
dt <- fread('
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header = TRUE)
我实现了您想要的,如下所示:
library(data.table)
library(stringr)
dt <- data.table(read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL))
dt[,UniqueFunds.AllTypes. := NULL][,UniqueFunds.SaleOnly. := NULL]
#Get the different Fund Types
vals <- unique(unlist(str_extract_all(dt$FundType,"[a-z]")))
#Construct a new set of columns indicating which fund types are present
dt[,vals:=data.table(1*t(sapply(FundType,str_detect,vals))),with=FALSE]
#Calculate UniqueFunds.AllTypes
dt[, UniqueFunds.AllTypes. :=
rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name]
#Calculate only when ActicityType == "Sale" and use cummax to achieve desired output
dt[,UniqueFunds.SaleOnly. := 0
][ActivityType == "Sale", UniqueFunds.SaleOnly. :=
rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name
][,UniqueFunds.SaleOnly. := cummax(UniqueFunds.SaleOnly.), by = Name
]
#Cleanup vals
dt[,vals := NULL, with = FALSE]
nrussell 建议编写自定义函数的简明解决方案。让我放下我得到的东西。我尝试使用 cumsum()
和 duplicated()
,就像您尝试的那样。我做了两次大手术。一个用于 alltype
,另一个用于 saleonly
。首先,我为每个名字创建了索引。然后,我拆分 FundType
并使用 splitstackshape 包中的 cSplit()
以长格式格式化数据。然后,我为每个名称的每个索引号选择了最后一行。最后只选了一栏,alltype
.
library(splitstackshape)
library(zoo)
library(data.table)
setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"][,
.SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype
二期仅售。基本上,我对待售的子集数据重复了相同的方法,即 ana
。我还创建了一个没有售卖的数据集,就是ana2
。然后,我创建了一个包含两个数据集的列表(即 l
)并绑定它们。我用 Name
和 ind
更改了数据集的顺序,为每个名称和索引号取最后一行,处理 NA(填充 NA 并将每个名称的第一个 NA 替换为 0),最后选择了一列。最后的操作是将原来的dt
、alltype
、saleonly
、
# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana
# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2
# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly
cbind(dt, alltype, saleonly)
Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email a 1 0 1 1 0
2: John Sale a;b 2 2 2 2 2
3: John Webinar c;d 4 2 3 4 2
4: John Sale b 4 2 4 4 2
5: John Webinar e 5 2 5 5 2
6: John Conference b;d 5 2 6 5 2
7: John Sale b;e 5 3 7 5 3
8: Tom Email a 1 0 1 1 0
9: Tom Sale a;b 2 2 2 2 2
10: Tom Webinar c;d 4 2 3 4 2
11: Tom Sale b 4 2 4 4 2
12: Tom Webinar e 5 2 5 5 2
13: Tom Conference b;d 5 2 6 5 2
14: Tom Sale b;e;f 6 4 7 6 4
编辑
对于新数据集,我尝试了以下方法。基本上,我将我的方法用于这个新数据集的 saleonly 数据。修改仅在 alltype
部分。首先,我添加了索引,用 NA 替换了“”,并用具有 non-NA 值的行对数据进行了子集化。这是temp
。其余与上一个答案相同。现在我想在 FundType 中使用 NAs 的数据集,所以我使用了 setdiff()
。使用 rbindlist()
,我合并了两个数据集并创建了 temp
。其余与上一个答案相同。 sale-part 没有任何变化。我希望这对您的真实数据有用。
### all type
setDT(dt)[, ind := 1:.N, by = "Name"][,
FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype
whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype
### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly
cbind(dt, alltype, saleonly)
Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email NA 0 0 1 0 0
2: John Conference NA 0 0 2 0 0
3: John Email a 1 0 3 1 0
4: John Sale a;b 2 2 4 2 2
5: John Webinar c;d 4 2 5 4 2
6: John Sale b 4 2 6 4 2
7: John Webinar e 5 2 7 5 2
8: John Conference b;d 5 2 8 5 2
9: John Sale b;e 5 3 9 5 3
10: John Email NA 5 3 10 5 3
11: John Webinar NA 5 3 11 5 3
12: Tom Email a 1 0 1 1 0
13: Tom Sale a;b 2 2 2 2 2
14: Tom Webinar c;d 4 2 3 4 2
15: Tom Sale b 4 2 4 4 2
16: Tom Webinar e 5 2 5 5 2
17: Tom Conference b;d 5 2 6 5 2
18: Tom Sale b;e;f 6 4 7 6 4