使用多组度量列将数据框重塑为长格式

Reshape a dataframe to long format with multiple sets of measure columns

我有一个 R 数据框,是我使用 XML 包中的 readHTMLTable() 从互联网上抓取的。 table 看起来像下面的摘录,人口和年份有多个 variables/columns。 (请注意,年份不会跨列重复,并且代表人口的唯一标识符。)

        year1   pop1      year2   pop2     year3   pop3     
1                                                        
2       16XX    4675,0    1900    6453,0    1930   9981,2       
3       17XX    4739,3    1901    6553,5    1931   ...      
4       17XX    4834,0    1902    6684,0    1932   
5       180X    4930,0    1903    6818,0    1933        
6       180X    5029,0    1904    6955,0    1934        
7       181X    5129,0    1905    7094,0    1935
8       181X    5231,9    1906    7234,7    1936
9       182X    5297,0    1907    7329,0    1937
10      182X    5362,0    1908    7422,0    1938

我想将数据重新组织成两列,一列用于年份,一列用于人口,如下所示:

        year    pop     
1                                                        
2       16XX    4675,0
3       17XX    4739,3  
4       17XX    4834,0  
5       180X    4930,0
6       180X    5029,0  
7       181X    5129,0
8       181X    5231,9  
9       182X    5297,0
10      182X    5362,0  
11      1900    6453,0
12      1901    6553,5
13      1902    6684,0
...     ...     ...
21      1930    9981,2
22      ... 

来自 variables/columns year2year3 的值附加在 year1 下面,相应的人口值也是如此。

我考虑了以下几点:

(1) 遍历 population 和 year 列 (n>2) 并将这些值作为新观察值添加到 year1 和 population1 将起作用,但这似乎不必要地麻烦。

(2) 我试过如下 melt,但要么它无​​法处理跨多个列拆分的 id 变量,要么我没有正确实现它。

df.melt <- melt(df, id=c("year1", "year2",...)

(3) 最后,我考虑将每年的列作为自己的向量,并将这些向量中的每一个附加在一起,如下所示:

year.all <- c(df$year1, df$year2,...)

不过上面returns下面为year.all

[1]  1  2  3  3  4  4  5  5  6  6  7  8  8  9  9  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  1  1  2 ...

而不是这个

[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...

如果有一种直接的方法来完成这种重组,我很乐意学习。非常感谢您的帮助。

如果 'year'、'pop' 列交替出现,我们可以用 c(TRUE, FALSE) 进行子集化以获得第 1、3、5 等列。和 c(FALSE, TRUE) 得到 2, 4, 6,.. 由于回收。然后,我们 unlist 列并创建一个新的 'data.frame.

 df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]), 
                  pop=unlist(df1[c(FALSE, TRUE)]))
 row.names(df2) <- NULL
 head(df2)
 #   year    pop
 #1            
 #2 16XX 4675,0
 #3 17XX 4739,3
 #4 17XX 4834,0
 #5 180X 4930,0
 #6 180X 5029,0

或者另一种选择是

library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'), 
        sep='var.stubs')[order(.time_1), 3:4, with=FALSE]

数据

df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X", 
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0", 
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", 
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0", 
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "", 
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2", 
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))

使用 new feature in melt from data.table v1.9.5+:

require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))

您可以找到其余的小插图 here

另一种选择是使用split.default将数据帧拆分为数据帧列表,然后将它们绑定在一起:

lst <- lapply(split.default(df1, sub('.*(\d)', '\1', names(df1))),
              setNames, c('year','pop'))

do.call(rbind, lst)

给出了想要的结果:

    year     pop
1.1 16XX  4675,0
1.2 17XX  4739,3
1.3 17XX  4834,0
1.4 180X  4930,0
1.5 180X  5029,0
1.6 181X  5129,0
1.7 181X  5231,9
1.8 182X  5297,0
1.9 182X  5362,0
2.1 1900  6453,0
2.2 1901  6553,5
2.3 1902  6684,0
2.4 1903  6818,0
2.5 1904  6955,0
2.6 1905  7094,0
2.7 1906  7234,7
2.8 1907  7329,0
2.9 1908  7422,0
3.1 1930  9981,2
3.2 1931 10583,5
3.3 1932  8671,0
3.4 1933  9118,0
3.5 1934  9625,0
3.6 1935  8097,0
3.7 1936  7984,7
3.8 1937  8729,0
3.9 1938 10462,0

您还可以在最后一步使用 data.table 包中的 rbindlist

library(data.table)
rbindlist(lst)

已用数据:

df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
                      pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
                      year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
                      pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
                      year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
                      pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
                 .Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))