如何从更改的数据框中创建查找 table?
How to create a look-up table from a dataframe of changes?
我想根据变化的数据框创建查找 table。原始数据框的每一行都表示给定地区编码的变化。 该数据集涵盖了从 2009 年到 2019 年的某个时间段。虽然一个地区在那个时间段内可能会经历一些变化,但我想要 2009 年和2019年各区编码。也就是第一个和最新的编码。
数据框涵盖数百个地区。一些地区可能只经历一次变化,而其他地区可能经历多次变化。一个区可以合并或拆分成多个区。
理想的查找 table 如下所示:
coding_2009
coding_2019
00QR
S12000047
00QR
S12000048
00RB
S12000047
00RB
S12000048
coding_2009
是截至 2009 年的地区编码,coding_2019
是截至 2019 年的最新编码。
原始数据框(子集),其中每一行显示一个变化,看起来像:
past
new
date
00QR
S12000015
2009-01-01
S12000015
S12000047
2018-02-02
S12000015
S12000048
2018-02-02
00RB
S12000015
2009-01-01
S12000024
S12000047
2018-02-02
S12000024
S12000048
2018-02-02
对于每一行,past
是自 date
起重新编码为 new
的代码。
比如区00QR
变成了S12000015
,后来分裂成S12000047
和S12000048
。
我已经处理这个问题好几个星期了,尝试了不同的临时版本,但 none 似乎始终如一。请注意,代码需要考虑到一些地区只经历了一次变化,而其他地区可能经历了两次或更多变化。区域也可以拆分或合并,如示例所示。
理想的答案是 tidyverse
.
对于 reprex,我在下面选择了部分地区。
感谢您的帮助!将不胜感激。
代表数据:
(您也可以超越并使用原始数据集,Changes.csv
。请参阅下面的 link)
# Library tibble (a part of tidyverse) is needed to copy paste reprex data
#install.packages("tibble") # if you need to install it
library(tibble)
data <- tibble::tribble(
~past, ~new, ~date,
"00RJ", "S12000013", "2009-01-01",
"00QR", "S12000015", "2009-01-01",
"00RB", "S12000024", "2009-01-01",
"13UD", "E07000015", "2009-01-01",
"15UH", "E07000025", "2009-01-01",
"00HC", "E06000024", "2009-01-01",
"00KG", "E06000034", "2009-01-01",
"19UD", "E07000049", "2009-01-01",
"19UE", "E07000050", "2009-01-01",
"19UG", "E07000051", "2009-01-01",
"19UH", "E07000052", "2009-01-01",
"19UJ", "E07000053", "2009-01-01",
"E07000017", "E06000049", "2009-04-01",
"E07000025", "E06000053", "2009-04-01",
"E07000014", "E06000049", "2009-04-01",
"E07000015", "E06000049", "2009-04-01",
"S12000013", "S12000013", "2015-06-16",
"S12000013", "S12000013", "2015-11-01",
"S12000015", "S12000047", "2018-02-02",
"S12000024", "S12000047", "2018-02-02",
"S12000015", "S12000048", "2018-02-02",
"S12000024", "S12000048", "2018-02-02",
"E07000049", "E06000059", "2019-04-01",
"E07000050", "E06000059", "2019-04-01",
"E07000053", "E06000059", "2019-04-01",
"E07000051", "E06000059", "2019-04-01",
"E07000052", "E06000059", "2019-04-01"
)
# Convert date to Date (after being copy pasted as tibble)
data$date <- as.Date(data$date)
对于任何感兴趣的人,此数据来自英国的 Code History Database
。您可以从下面的 link 下载 zip。这是名为 Changes.csv
: https://geoportal.statistics.gov.uk/datasets/code-history-database-december-2019-for-the-united-kingdom 的文件。注意,在Changes.csv
中,past
被命名为geogcd_p
,new
被命名为geogcd
,date
被命名为oper_date
.
您实际上是在查看一个扁平的树结构。使用 igraph 包可以很容易地绘制它:
library(igraph)
g <- dat %>% select( past,new ) %>% t %>% c %>% graph
plot( g )
从现在开始,有很多方法可以解决这个问题,但归结为 深度优先 或 宽度优先 方法问题。
假设我们有几个小图是合理的。一堆经过一些更改的不同代码,而不是经过多次更改的 select 几个代码。
这建议采用 宽度优先 方法,并且可以通过将数据与其自身连接来解决,希望不要太多次:
## work with data.table for that extra speed.
setDT(dat)
## remove duplicate entries of same code
dat <- dat[, .(date=max(date)), by=.(past,new) ]
## these are the roots, all `past` values never present in `new`
roots <- dat[ !past %in% new ]
## likewise, the leaves are those that never appear as `past` , unless they are self referencing.
leaves <- unique( dat[ !new %in% past | new == past, !"past" ], by="new" )
dd <- copy(roots)
## sucessively add next step from the source data till we have arrived at leaves only.
while( !all( dd$new %in% leaves$new ) ) {
dd <- unique(
merge( dd, dat, by.x="new", by.y="past", all.x=TRUE )[ , .(date.x, past, new=coalesce(new.y,new), date.y=coalesce(date.y,date.x) ) ]
)
}
## final cleanup
dd[ order(past), .(coding_2009=past,coding_2019=new) ]
输出:
> dd[ order(past), .(coding_2009=past,coding_2019=new) ]
coding_2009 coding_2019
1: 00HC E06000024
2: 00KG E06000034
3: 00QR S12000047
4: 00QR S12000048
5: 00RB S12000047
6: 00RB S12000048
7: 00RJ S12000013
8: 13UD E06000049
9: 15UH E06000053
10: 19UD E06000059
11: 19UE E06000059
12: 19UG E06000059
13: 19UH E06000059
14: 19UJ E06000059
15: E07000014 E06000049
16: E07000017 E06000049
现在我只看了迷你数据集,所以我不知道代码在野外会如何形成,但你可以试一试。
看上图,我们看到每个图从根到叶最多有3步,这意味着上面的while循环只需要运行一次。
Sirius 使用 data.table
提供了一个惊人的答案。在这里,我将该答案翻译成 tidyverse
:
# Remove duplicate entries of same code
data_sub <- data %>%
group_by(past, new) %>%
filter(date == max(date)) %>%
ungroup()
# Create roots: All past values never present in new
roots <- data_sub %>%
filter(!past %in% new)
# Create leaves: Those that never appear as past, unless they self reference
leaves <- data_sub %>%
filter(!new %in% past | new == past) %>%
select(-past) %>%
distinct(new, .keep_all = TRUE)
# Copy before loop
dd <- roots
# Successively add next step from source data until we have arrived at leaves only
while(!all(dd$new %in% leaves$new)) {
# Join
dd_merge <- left_join(dd, data_sub, by = c("new" = "past"))
# Coalesce
dd_sub <- dd_merge %>%
transmute(date.x,
past,
new = coalesce(new.y, new),
date.y = coalesce(date.y, date.x))
# Take unique
dd <- unique(dd_sub)
}
我想根据变化的数据框创建查找 table。原始数据框的每一行都表示给定地区编码的变化。 该数据集涵盖了从 2009 年到 2019 年的某个时间段。虽然一个地区在那个时间段内可能会经历一些变化,但我想要 2009 年和2019年各区编码。也就是第一个和最新的编码。
数据框涵盖数百个地区。一些地区可能只经历一次变化,而其他地区可能经历多次变化。一个区可以合并或拆分成多个区。
理想的查找 table 如下所示:
coding_2009 | coding_2019 |
---|---|
00QR | S12000047 |
00QR | S12000048 |
00RB | S12000047 |
00RB | S12000048 |
coding_2009
是截至 2009 年的地区编码,coding_2019
是截至 2019 年的最新编码。
原始数据框(子集),其中每一行显示一个变化,看起来像:
past | new | date |
---|---|---|
00QR | S12000015 | 2009-01-01 |
S12000015 | S12000047 | 2018-02-02 |
S12000015 | S12000048 | 2018-02-02 |
00RB | S12000015 | 2009-01-01 |
S12000024 | S12000047 | 2018-02-02 |
S12000024 | S12000048 | 2018-02-02 |
对于每一行,past
是自 date
起重新编码为 new
的代码。
比如区00QR
变成了S12000015
,后来分裂成S12000047
和S12000048
。
我已经处理这个问题好几个星期了,尝试了不同的临时版本,但 none 似乎始终如一。请注意,代码需要考虑到一些地区只经历了一次变化,而其他地区可能经历了两次或更多变化。区域也可以拆分或合并,如示例所示。
理想的答案是 tidyverse
.
对于 reprex,我在下面选择了部分地区。
感谢您的帮助!将不胜感激。
代表数据:
(您也可以超越并使用原始数据集,Changes.csv
。请参阅下面的 link)
# Library tibble (a part of tidyverse) is needed to copy paste reprex data
#install.packages("tibble") # if you need to install it
library(tibble)
data <- tibble::tribble(
~past, ~new, ~date,
"00RJ", "S12000013", "2009-01-01",
"00QR", "S12000015", "2009-01-01",
"00RB", "S12000024", "2009-01-01",
"13UD", "E07000015", "2009-01-01",
"15UH", "E07000025", "2009-01-01",
"00HC", "E06000024", "2009-01-01",
"00KG", "E06000034", "2009-01-01",
"19UD", "E07000049", "2009-01-01",
"19UE", "E07000050", "2009-01-01",
"19UG", "E07000051", "2009-01-01",
"19UH", "E07000052", "2009-01-01",
"19UJ", "E07000053", "2009-01-01",
"E07000017", "E06000049", "2009-04-01",
"E07000025", "E06000053", "2009-04-01",
"E07000014", "E06000049", "2009-04-01",
"E07000015", "E06000049", "2009-04-01",
"S12000013", "S12000013", "2015-06-16",
"S12000013", "S12000013", "2015-11-01",
"S12000015", "S12000047", "2018-02-02",
"S12000024", "S12000047", "2018-02-02",
"S12000015", "S12000048", "2018-02-02",
"S12000024", "S12000048", "2018-02-02",
"E07000049", "E06000059", "2019-04-01",
"E07000050", "E06000059", "2019-04-01",
"E07000053", "E06000059", "2019-04-01",
"E07000051", "E06000059", "2019-04-01",
"E07000052", "E06000059", "2019-04-01"
)
# Convert date to Date (after being copy pasted as tibble)
data$date <- as.Date(data$date)
对于任何感兴趣的人,此数据来自英国的 Code History Database
。您可以从下面的 link 下载 zip。这是名为 Changes.csv
: https://geoportal.statistics.gov.uk/datasets/code-history-database-december-2019-for-the-united-kingdom 的文件。注意,在Changes.csv
中,past
被命名为geogcd_p
,new
被命名为geogcd
,date
被命名为oper_date
.
您实际上是在查看一个扁平的树结构。使用 igraph 包可以很容易地绘制它:
library(igraph)
g <- dat %>% select( past,new ) %>% t %>% c %>% graph
plot( g )
从现在开始,有很多方法可以解决这个问题,但归结为 深度优先 或 宽度优先 方法问题。
假设我们有几个小图是合理的。一堆经过一些更改的不同代码,而不是经过多次更改的 select 几个代码。
这建议采用 宽度优先 方法,并且可以通过将数据与其自身连接来解决,希望不要太多次:
## work with data.table for that extra speed.
setDT(dat)
## remove duplicate entries of same code
dat <- dat[, .(date=max(date)), by=.(past,new) ]
## these are the roots, all `past` values never present in `new`
roots <- dat[ !past %in% new ]
## likewise, the leaves are those that never appear as `past` , unless they are self referencing.
leaves <- unique( dat[ !new %in% past | new == past, !"past" ], by="new" )
dd <- copy(roots)
## sucessively add next step from the source data till we have arrived at leaves only.
while( !all( dd$new %in% leaves$new ) ) {
dd <- unique(
merge( dd, dat, by.x="new", by.y="past", all.x=TRUE )[ , .(date.x, past, new=coalesce(new.y,new), date.y=coalesce(date.y,date.x) ) ]
)
}
## final cleanup
dd[ order(past), .(coding_2009=past,coding_2019=new) ]
输出:
> dd[ order(past), .(coding_2009=past,coding_2019=new) ]
coding_2009 coding_2019
1: 00HC E06000024
2: 00KG E06000034
3: 00QR S12000047
4: 00QR S12000048
5: 00RB S12000047
6: 00RB S12000048
7: 00RJ S12000013
8: 13UD E06000049
9: 15UH E06000053
10: 19UD E06000059
11: 19UE E06000059
12: 19UG E06000059
13: 19UH E06000059
14: 19UJ E06000059
15: E07000014 E06000049
16: E07000017 E06000049
现在我只看了迷你数据集,所以我不知道代码在野外会如何形成,但你可以试一试。
看上图,我们看到每个图从根到叶最多有3步,这意味着上面的while循环只需要运行一次。
Sirius 使用 data.table
提供了一个惊人的答案。在这里,我将该答案翻译成 tidyverse
:
# Remove duplicate entries of same code
data_sub <- data %>%
group_by(past, new) %>%
filter(date == max(date)) %>%
ungroup()
# Create roots: All past values never present in new
roots <- data_sub %>%
filter(!past %in% new)
# Create leaves: Those that never appear as past, unless they self reference
leaves <- data_sub %>%
filter(!new %in% past | new == past) %>%
select(-past) %>%
distinct(new, .keep_all = TRUE)
# Copy before loop
dd <- roots
# Successively add next step from source data until we have arrived at leaves only
while(!all(dd$new %in% leaves$new)) {
# Join
dd_merge <- left_join(dd, data_sub, by = c("new" = "past"))
# Coalesce
dd_sub <- dd_merge %>%
transmute(date.x,
past,
new = coalesce(new.y, new),
date.y = coalesce(date.y, date.x))
# Take unique
dd <- unique(dd_sub)
}