(统计)2 向 table 归一化

(statistics) 2-way table normalization

我有一个 table 这样的。

    X  X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1  SU 103.27 105.2  99.7 106.7  96.7 108.4  88.7 73.67
2  BS 100.17 104.5  97.6 103.6  91.7 106.2  85.5 73.66
3  DG 101.00 102.5  98.9 101.1  91.2 106.2  80.9 75.67
4  IC  97.80 103.4  97.2 102.4  88.4 103.3  85.7 70.00
5  DJ 106.20 103.1  99.1  97.7  90.7 106.2  77.5 74.00
6  GJ  97.47 101.7  98.6 101.2  89.9 105.6  81.7 73.33
7  US  99.80 105.6  98.2   0.0  81.7 103.6  84.3 68.00
8  GG  98.13 105.7  98.6 103.7  92.2 105.2  85.9 73.66
9  GO  96.13 101.2  96.8 101.7  86.4 105.7  78.1 72.66
10 CB 104.20 105.2 101.5 100.3  88.3 106.2  78.8 72.00
11 CN 107.20  95.0  96.1  98.7  88.2 103.7  78.5 71.33
12 GB  98.87 102.0  95.3 100.2  87.2 104.2  78.5 70.33
13 GN  99.57 103.3  95.6 102.6  89.2 103.7  83.2 72.00
14 JB  99.60  96.2  98.2  96.2  86.2 101.7  84.5 71.34
15 JN  93.83  98.6  98.8  95.2  87.2 102.7  83.9 70.33
16 JJ  93.63 101.7  93.2  98.1   0.0   0.0  83.9 71.00
17 SJ   0.00   0.0   0.0   0.0   0.0 106.5  81.9 73.34

这是每年韩国部分省份的考试成绩。 直到2013年,测试分数的边界是[0,110],但在2014年改为[0,100]。

我的 objective 是将测试分数归一化到某个边界或希望是某个标准化区域。

也许,我可以先把2008年和2013年的分数换算成100%的比例,减去列均值除以每列的标准差来实现。但是,那只是在每一列中标准化。

是否有任何可能的方法来标准化(或标准化)整个考试成绩?

顺便说一句,测试分数0表示没有测试,所以在归一化过程中必须忽略它。而且,为了您的方便,这是 csv 格式..

,2008,2009,2010,2011,2012,2013,2014,2015
SU,103.27,105.2,99.7,106.7,96.7,108.4,88.7,73.67
BS,100.17,104.5,97.6,103.6,91.7,106.2,85.5,73.66
DG,101,102.5,98.9,101.1,91.2,106.2,80.9,75.67
IC,97.8,103.4,97.2,102.4,88.4,103.3,85.7,70
DJ,106.2,103.1,99.1,97.7,90.7,106.2,77.5,74
GJ,97.47,101.7,98.6,101.2,89.9,105.6,81.7,73.33
US,99.8,105.6,98.2,0,81.7,103.6,84.3,68
GG,98.13,105.7,98.6,103.7,92.2,105.2,85.9,73.66
GO,96.13,101.2,96.8,101.7,86.4,105.7,78.1,72.66
CB,104.2,105.2,101.5,100.3,88.3,106.2,78.8,72
CN,107.2,95,96.1,98.7,88.2,103.7,78.5,71.33
GB,98.87,102,95.3,100.2,87.2,104.2,78.5,70.33
GN,99.57,103.3,95.6,102.6,89.2,103.7,83.2,72
JB,99.6,96.2,98.2,96.2,86.2,101.7,84.5,71.34
JN,93.83,98.6,98.8,95.2,87.2,102.7,83.9,70.33
JJ,93.63,101.7,93.2,98.1,0,0,83.9,71
SJ,0,0,0,0,0,106.5,81.9,73.34 

我认为最好的办法可能是将第 2 列转换为第 6 列,即将 [0-110] 范围内的列转换为 [0-100] 范围内的列。这样,一切都将处于相同的比例。为了做到这一点:

数据:

df <- read.table(header=T, text='    X  X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1  SU 103.27 105.2  99.7 106.7  96.7 108.4  88.7 73.67
2  BS 100.17 104.5  97.6 103.6  91.7 106.2  85.5 73.66
3  DG 101.00 102.5  98.9 101.1  91.2 106.2  80.9 75.67
4  IC  97.80 103.4  97.2 102.4  88.4 103.3  85.7 70.00
5  DJ 106.20 103.1  99.1  97.7  90.7 106.2  77.5 74.00
6  GJ  97.47 101.7  98.6 101.2  89.9 105.6  81.7 73.33
7  US  99.80 105.6  98.2   0.0  81.7 103.6  84.3 68.00
8  GG  98.13 105.7  98.6 103.7  92.2 105.2  85.9 73.66
9  GO  96.13 101.2  96.8 101.7  86.4 105.7  78.1 72.66
10 CB 104.20 105.2 101.5 100.3  88.3 106.2  78.8 72.00
11 CN 107.20  95.0  96.1  98.7  88.2 103.7  78.5 71.33
12 GB  98.87 102.0  95.3 100.2  87.2 104.2  78.5 70.33
13 GN  99.57 103.3  95.6 102.6  89.2 103.7  83.2 72.00
14 JB  99.60  96.2  98.2  96.2  86.2 101.7  84.5 71.34
15 JN  93.83  98.6  98.8  95.2  87.2 102.7  83.9 70.33
16 JJ  93.63 101.7  93.2  98.1   0.0   0.0  83.9 71.00
17 SJ   0.00   0.0   0.0   0.0   0.0 106.5  81.9 73.34')

你可以这样做:

df[2:6] <- lapply(df[2:6], function(x) {
   x / 110 * 100 
})

基本上你除以 120,这是 [0-110] 中的最大值,以便转换为 [0-1] 之间的范围,然后乘以 100 以将其转换为 [0-100] 之间的范围.

输出:

> df
    X    X2008    X2009    X2010    X2011    X2012 X2013 X2014 X2015
1  SU 93.88182 95.63636 90.63636 97.00000 87.90909 108.4  88.7 73.67
2  BS 91.06364 95.00000 88.72727 94.18182 83.36364 106.2  85.5 73.66
3  DG 91.81818 93.18182 89.90909 91.90909 82.90909 106.2  80.9 75.67
4  IC 88.90909 94.00000 88.36364 93.09091 80.36364 103.3  85.7 70.00
5  DJ 96.54545 93.72727 90.09091 88.81818 82.45455 106.2  77.5 74.00
6  GJ 88.60909 92.45455 89.63636 92.00000 81.72727 105.6  81.7 73.33
7  US 90.72727 96.00000 89.27273  0.00000 74.27273 103.6  84.3 68.00
8  GG 89.20909 96.09091 89.63636 94.27273 83.81818 105.2  85.9 73.66
9  GO 87.39091 92.00000 88.00000 92.45455 78.54545 105.7  78.1 72.66
10 CB 94.72727 95.63636 92.27273 91.18182 80.27273 106.2  78.8 72.00
11 CN 97.45455 86.36364 87.36364 89.72727 80.18182 103.7  78.5 71.33
12 GB 89.88182 92.72727 86.63636 91.09091 79.27273 104.2  78.5 70.33
13 GN 90.51818 93.90909 86.90909 93.27273 81.09091 103.7  83.2 72.00
14 JB 90.54545 87.45455 89.27273 87.45455 78.36364 101.7  84.5 71.34
15 JN 85.30000 89.63636 89.81818 86.54545 79.27273 102.7  83.9 70.33
16 JJ 85.11818 92.45455 84.72727 89.18182  0.00000   0.0  83.9 71.00
17 SJ  0.00000  0.00000  0.00000  0.00000  0.00000 106.5  81.9 73.34

现在您可以比较年份了。此外,您会注意到零将保持为零。