使用多组度量列将数据框重塑为长格式
Reshape a dataframe to long format with multiple sets of measure columns
我有一个 R 数据框,是我使用 XML
包中的 readHTMLTable()
从互联网上抓取的。 table 看起来像下面的摘录,人口和年份有多个 variables/columns。 (请注意,年份不会跨列重复,并且代表人口的唯一标识符。)
year1 pop1 year2 pop2 year3 pop3
1
2 16XX 4675,0 1900 6453,0 1930 9981,2
3 17XX 4739,3 1901 6553,5 1931 ...
4 17XX 4834,0 1902 6684,0 1932
5 180X 4930,0 1903 6818,0 1933
6 180X 5029,0 1904 6955,0 1934
7 181X 5129,0 1905 7094,0 1935
8 181X 5231,9 1906 7234,7 1936
9 182X 5297,0 1907 7329,0 1937
10 182X 5362,0 1908 7422,0 1938
我想将数据重新组织成两列,一列用于年份,一列用于人口,如下所示:
year pop
1
2 16XX 4675,0
3 17XX 4739,3
4 17XX 4834,0
5 180X 4930,0
6 180X 5029,0
7 181X 5129,0
8 181X 5231,9
9 182X 5297,0
10 182X 5362,0
11 1900 6453,0
12 1901 6553,5
13 1902 6684,0
... ... ...
21 1930 9981,2
22 ...
来自 variables/columns year2
和 year3
的值附加在 year1
下面,相应的人口值也是如此。
我考虑了以下几点:
(1) 遍历 population 和 year 列 (n>2
) 并将这些值作为新观察值添加到 year1 和 population1 将起作用,但这似乎不必要地麻烦。
(2) 我试过如下 melt,但要么它无法处理跨多个列拆分的 id 变量,要么我没有正确实现它。
df.melt <- melt(df, id=c("year1", "year2",...)
(3) 最后,我考虑将每年的列作为自己的向量,并将这些向量中的每一个附加在一起,如下所示:
year.all <- c(df$year1, df$year2,...)
不过上面returns下面为year.all
[1] 1 2 3 3 4 4 5 5 6 6 7 8 8 9 9 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 2 ...
而不是这个
[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
如果有一种直接的方法来完成这种重组,我很乐意学习。非常感谢您的帮助。
如果 'year'、'pop' 列交替出现,我们可以用 c(TRUE, FALSE)
进行子集化以获得第 1、3、5 等列。和 c(FALSE, TRUE)
得到 2, 4, 6,.. 由于回收。然后,我们 unlist
列并创建一个新的 'data.frame.
df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]),
pop=unlist(df1[c(FALSE, TRUE)]))
row.names(df2) <- NULL
head(df2)
# year pop
#1
#2 16XX 4675,0
#3 17XX 4739,3
#4 17XX 4834,0
#5 180X 4930,0
#6 180X 5029,0
或者另一种选择是
library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'),
sep='var.stubs')[order(.time_1), 3:4, with=FALSE]
数据
df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X",
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0",
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0",
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L,
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0",
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L,
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "",
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2",
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
使用 new feature in melt
from data.table v1.9.5+
:
require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))
您可以找到其余的小插图 here。
另一种选择是使用split.default
将数据帧拆分为数据帧列表,然后将它们绑定在一起:
lst <- lapply(split.default(df1, sub('.*(\d)', '\1', names(df1))),
setNames, c('year','pop'))
do.call(rbind, lst)
给出了想要的结果:
year pop
1.1 16XX 4675,0
1.2 17XX 4739,3
1.3 17XX 4834,0
1.4 180X 4930,0
1.5 180X 5029,0
1.6 181X 5129,0
1.7 181X 5231,9
1.8 182X 5297,0
1.9 182X 5362,0
2.1 1900 6453,0
2.2 1901 6553,5
2.3 1902 6684,0
2.4 1903 6818,0
2.5 1904 6955,0
2.6 1905 7094,0
2.7 1906 7234,7
2.8 1907 7329,0
2.9 1908 7422,0
3.1 1930 9981,2
3.2 1931 10583,5
3.3 1932 8671,0
3.4 1933 9118,0
3.5 1934 9625,0
3.6 1935 8097,0
3.7 1936 7984,7
3.8 1937 8729,0
3.9 1938 10462,0
您还可以在最后一步使用 data.table
包中的 rbindlist
:
library(data.table)
rbindlist(lst)
已用数据:
df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
.Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))
我有一个 R 数据框,是我使用 XML
包中的 readHTMLTable()
从互联网上抓取的。 table 看起来像下面的摘录,人口和年份有多个 variables/columns。 (请注意,年份不会跨列重复,并且代表人口的唯一标识符。)
year1 pop1 year2 pop2 year3 pop3
1
2 16XX 4675,0 1900 6453,0 1930 9981,2
3 17XX 4739,3 1901 6553,5 1931 ...
4 17XX 4834,0 1902 6684,0 1932
5 180X 4930,0 1903 6818,0 1933
6 180X 5029,0 1904 6955,0 1934
7 181X 5129,0 1905 7094,0 1935
8 181X 5231,9 1906 7234,7 1936
9 182X 5297,0 1907 7329,0 1937
10 182X 5362,0 1908 7422,0 1938
我想将数据重新组织成两列,一列用于年份,一列用于人口,如下所示:
year pop
1
2 16XX 4675,0
3 17XX 4739,3
4 17XX 4834,0
5 180X 4930,0
6 180X 5029,0
7 181X 5129,0
8 181X 5231,9
9 182X 5297,0
10 182X 5362,0
11 1900 6453,0
12 1901 6553,5
13 1902 6684,0
... ... ...
21 1930 9981,2
22 ...
来自 variables/columns year2
和 year3
的值附加在 year1
下面,相应的人口值也是如此。
我考虑了以下几点:
(1) 遍历 population 和 year 列 (n>2
) 并将这些值作为新观察值添加到 year1 和 population1 将起作用,但这似乎不必要地麻烦。
(2) 我试过如下 melt,但要么它无法处理跨多个列拆分的 id 变量,要么我没有正确实现它。
df.melt <- melt(df, id=c("year1", "year2",...)
(3) 最后,我考虑将每年的列作为自己的向量,并将这些向量中的每一个附加在一起,如下所示:
year.all <- c(df$year1, df$year2,...)
不过上面returns下面为year.all
[1] 1 2 3 3 4 4 5 5 6 6 7 8 8 9 9 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 2 ...
而不是这个
[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
如果有一种直接的方法来完成这种重组,我很乐意学习。非常感谢您的帮助。
如果 'year'、'pop' 列交替出现,我们可以用 c(TRUE, FALSE)
进行子集化以获得第 1、3、5 等列。和 c(FALSE, TRUE)
得到 2, 4, 6,.. 由于回收。然后,我们 unlist
列并创建一个新的 'data.frame.
df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]),
pop=unlist(df1[c(FALSE, TRUE)]))
row.names(df2) <- NULL
head(df2)
# year pop
#1
#2 16XX 4675,0
#3 17XX 4739,3
#4 17XX 4834,0
#5 180X 4930,0
#6 180X 5029,0
或者另一种选择是
library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'),
sep='var.stubs')[order(.time_1), 3:4, with=FALSE]
数据
df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X",
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0",
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0",
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L,
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0",
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L,
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "",
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2",
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
使用 new feature in melt
from data.table v1.9.5+
:
require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))
您可以找到其余的小插图 here。
另一种选择是使用split.default
将数据帧拆分为数据帧列表,然后将它们绑定在一起:
lst <- lapply(split.default(df1, sub('.*(\d)', '\1', names(df1))),
setNames, c('year','pop'))
do.call(rbind, lst)
给出了想要的结果:
year pop 1.1 16XX 4675,0 1.2 17XX 4739,3 1.3 17XX 4834,0 1.4 180X 4930,0 1.5 180X 5029,0 1.6 181X 5129,0 1.7 181X 5231,9 1.8 182X 5297,0 1.9 182X 5362,0 2.1 1900 6453,0 2.2 1901 6553,5 2.3 1902 6684,0 2.4 1903 6818,0 2.5 1904 6955,0 2.6 1905 7094,0 2.7 1906 7234,7 2.8 1907 7329,0 2.9 1908 7422,0 3.1 1930 9981,2 3.2 1931 10583,5 3.3 1932 8671,0 3.4 1933 9118,0 3.5 1934 9625,0 3.6 1935 8097,0 3.7 1936 7984,7 3.8 1937 8729,0 3.9 1938 10462,0
您还可以在最后一步使用 data.table
包中的 rbindlist
:
library(data.table)
rbindlist(lst)
已用数据:
df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
.Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))