计算并发订阅

Count concurrent subscriptions

我有一个数据库,其中包含许多人(可能)同时多次订阅某项服务 运行 以及订阅期间每个事件的交易数据。我正在尝试创建一个变量来计算用户在给定交易时间拥有的当前活动订阅数。

举个例子,我的数据以以下形式存在:

person | subscription | obs_date | sub_start_date | sub_end_date | num_concurrent_subs
--------------------------------------------------------------------------------------
1      | 1            | 09/01/10 | 09/01/10       | 09/01/11     | 1
1      | 1            | 10/01/10 | 09/01/10       | 09/01/11     | 2
1      | 1            | 11/01/10 | 09/01/10       | 09/01/11     | 2
1      | 2            | 10/01/10 | 10/01/10       | 09/01/11     | 2
1      | 2            | 11/01/10 | 10/01/10       | 09/01/11     | 2
1      | 3            | 11/01/14 | 09/01/14       | .            | 1
1      | 3            | 11/01/16 | 09/01/14       | .            | 1
1      | 4            | 11/01/15 | 10/01/15       | 11/01/15     | 3
1      | 5            | 11/01/15 | 10/01/15       | 11/01/15     | 3

每个人依此类推。我想生成上面的 num_concurrent_subs

也就是说,对于每个人,查看每个观察结果并找出它落在 sub_start_datesub_end_date 范围内的订阅数。

我已经阅读了一些关于 Stata 的 count 函数的内容,并且相信我已经接近解决方案,但我不确定如何在不同的订阅中检查它。

您可以通过将订阅信息与交易数据分开并将订阅数据转换为长格式来实现这一点,其中一个观察值用于开始日期,另一个观察值用于结束日期。然后您重新组合交易数据并按单个日期变量排序。您使用 onoff 变量来跟踪每个订阅的开始和结束。类似于:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(person subscription) str8(obs_date sub_start_date sub_end_date) byte num_concurrent_subs
1 1 "09/01/10" "09/01/10" "09/01/11" 1
1 1 "10/01/10" "09/01/10" "09/01/11" 2
1 1 "11/01/10" "09/01/10" "09/01/11" 2
1 2 "10/01/10" "10/01/10" "09/01/11" 2
1 2 "11/01/10" "10/01/10" "09/01/11" 2
1 3 "11/01/14" "09/01/14" "."        1
1 3 "11/01/16" "09/01/14" "."        1
1 4 "11/01/15" "10/01/15" "11/01/15" 3
1 5 "11/01/15" "10/01/15" "11/01/15" 3
end

* should always have an observation identifier
gen obsid = _n

* convert string to Stata numeric dates
gen odate = daily(obs_date,"MD20Y")
gen substart = daily(sub_start_date,"MD20Y")
gen subend = daily(sub_end_date,"MD20Y")
format %td odate substart subend
save "main_data.dta", replace

* reduce to subscription info with one obs for the start and one obs
* for the end of each subscription. use an onoff variable to tract
* start and end events
keep person subscription substart subend
bysort person subscription substart subend: keep if _n == 1
expand 2
bysort person subscription: gen adate = cond(_n == 1, substart, subend)
by person subscription: gen onoff = cond(_n == 1, 1, -1)
replace onoff = 0 if mi(adate)
format %td adate

append using "main_data.dta"

* include obs date in adate and nothing happens on the observation date
replace adate = odate if !mi(obsid)
replace onoff = 0 if !mi(obsid)

* order by person adate, put on event first, then obs events, then off events
gsort person adate -onoff
by person: gen concur = sum(onoff)

* return to original obs
keep if !mi(obsid)
sort obsid

这是使用 rangejoin(来自 SSC)的另一种方法。要安装它,请输入 Stata 的命令 window:

ssc install rangejoin

使用 rangejoin,您可以将每个订阅与订阅开始和结束日期内的所有交易数据配对。然后,这只是计算每个交易观察的问题,它与多少订阅配对。

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(person subscription) str8(obs_date sub_start_date sub_end_date) byte num_concurrent_subs
1 1 "09/01/10" "09/01/10" "09/01/11" 1
1 1 "10/01/10" "09/01/10" "09/01/11" 2
1 1 "11/01/10" "09/01/10" "09/01/11" 2
1 2 "10/01/10" "10/01/10" "09/01/11" 2
1 2 "11/01/10" "10/01/10" "09/01/11" 2
1 3 "11/01/14" "09/01/14" "."        1
1 3 "11/01/16" "09/01/14" "."        1
1 4 "11/01/15" "10/01/15" "11/01/15" 3
1 5 "11/01/15" "10/01/15" "11/01/15" 3
end

* should always have an observation identifier
gen obsid = _n

* convert string to Stata numeric dates
gen odate = daily(obs_date,"MD20Y")
gen substart = daily(sub_start_date,"MD20Y")
gen subend = daily(sub_end_date,"MD20Y")
format %td odate substart subend
save "main_data.dta", replace

* reduce to subscription start and end date per person
bysort person subscription substart subend: keep if _n == 1
keep person substart subend

* missing values will exclude obs so use a date in the future
replace subend = mdy(1,1,2099) if mi(subend)

* pair each subscription with an obs date
rangejoin odate substart subend using "main_data.dta", by(person)

* the number of current subcription is the number of pairings
bysort obsid: gen current = _N

* return to original obs
by obsid: keep if _n == 1
sort obsid
drop substart subend
rename (substart_U subend_U) (substart subend)