如何用Stata计算随时间变化的历史均值
How to calculate time varying historical mean with Stata
如何使用至少有四个观测值的扩展 window 来计算 X
的平均值?
这是一个数字示例:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
46.654168
44.924652
43.807024
45.679814
48.366395
49.883396
48.230502
49.869179
53.942757
56.167884
56.226512
56.25608
58.765728
62.077038
62.780799
61.858235
61.167646
60.671859
60.480263
60.226433
61.65349
60.769882
61.497553
60.146182
60.292934
60.173739
58.60077
58.445601
60.404868
end
Time-varying means in an expanding time window can be phrased otherwise as to imply the mean of all values from the start of records to the current date. You don't give a time variable so I assume data are in order and supply a time variable.
The community-contributed command rangestat
(to be installed from SSC using ssc install rangestat
) can give the mean of all values to date in this way:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
end
gen t = _n
rangestat (count) X (mean) X, int(t . 0)
list
+-------------------------------------+
| X t X_count X_mean |
|-------------------------------------|
1. | 50.73547 1 1 50.73547 |
2. | 48.27841 2 2 49.506941 |
3. | 42.80767 3 3 47.273851 |
4. | 49.24785 4 4 47.767351 |
5. | 52.20223 5 5 48.654327 |
|-------------------------------------|
6. | 49.72669 6 6 48.833054 |
7. | 50.82317 7 7 49.117356 |
8. | 49.09935 8 8 49.115105 |
9. | 48.94956 9 9 49.096711 |
10. | 47.41043 10 10 48.928084 |
+-------------------------------------+
Evidently you can ignore results for small counts as you please.
The syntax is naturally explained in the help
for rangestat
: suffice it to say here that the syntax for the option -- namely interval(t . 0)
-- is three-fold:
- for the time variable
t
and two offsets
backwards as far as possible: system missing .
here means arbitrarily large
forwards just 0
In mathematical terms the mean is from time minus infinity, or as much as possible, to time 0, the present.
The count
result is the number of observations in the window with non-missing values on X
. Here as the time variable is 1 up the count is trivially the same as the time variable, but in real problems the time variable is much more likely to be a date of some kind. Unlike some other commands rangestat
doesn't have an option to insist on a minimum number of points with non-missing values in a window, but you can count how many there are and decide to ignore those based on too few data. That is left to the user here.
Incidentally, you could make a good start on this kind of problem by working out a cumulative sum and then dividing by the number of values so far. That needs care with (e.g.) gaps in data, irregularly spaced data or missing values and a virtue of rangestat
is that all such difficulties are considered.
如何使用至少有四个观测值的扩展 window 来计算 X
的平均值?
这是一个数字示例:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
46.654168
44.924652
43.807024
45.679814
48.366395
49.883396
48.230502
49.869179
53.942757
56.167884
56.226512
56.25608
58.765728
62.077038
62.780799
61.858235
61.167646
60.671859
60.480263
60.226433
61.65349
60.769882
61.497553
60.146182
60.292934
60.173739
58.60077
58.445601
60.404868
end
Time-varying means in an expanding time window can be phrased otherwise as to imply the mean of all values from the start of records to the current date. You don't give a time variable so I assume data are in order and supply a time variable.
The community-contributed command rangestat
(to be installed from SSC using ssc install rangestat
) can give the mean of all values to date in this way:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
end
gen t = _n
rangestat (count) X (mean) X, int(t . 0)
list
+-------------------------------------+
| X t X_count X_mean |
|-------------------------------------|
1. | 50.73547 1 1 50.73547 |
2. | 48.27841 2 2 49.506941 |
3. | 42.80767 3 3 47.273851 |
4. | 49.24785 4 4 47.767351 |
5. | 52.20223 5 5 48.654327 |
|-------------------------------------|
6. | 49.72669 6 6 48.833054 |
7. | 50.82317 7 7 49.117356 |
8. | 49.09935 8 8 49.115105 |
9. | 48.94956 9 9 49.096711 |
10. | 47.41043 10 10 48.928084 |
+-------------------------------------+
Evidently you can ignore results for small counts as you please.
The syntax is naturally explained in the help
for rangestat
: suffice it to say here that the syntax for the option -- namely interval(t . 0)
-- is three-fold:
- for the time variable
t
and two offsets
backwards as far as possible: system missing
.
here means arbitrarily largeforwards just 0
In mathematical terms the mean is from time minus infinity, or as much as possible, to time 0, the present.
The count
result is the number of observations in the window with non-missing values on X
. Here as the time variable is 1 up the count is trivially the same as the time variable, but in real problems the time variable is much more likely to be a date of some kind. Unlike some other commands rangestat
doesn't have an option to insist on a minimum number of points with non-missing values in a window, but you can count how many there are and decide to ignore those based on too few data. That is left to the user here.
Incidentally, you could make a good start on this kind of problem by working out a cumulative sum and then dividing by the number of values so far. That needs care with (e.g.) gaps in data, irregularly spaced data or missing values and a virtue of rangestat
is that all such difficulties are considered.