R / dplyr:在可连接列上连接两个具有宽格式和长格式的表
R / dplyr: Joining two tables with wide vs. long format on joinable columns
我正在处理一些我想连接在一起的 public 地址数据,但我不确定最佳方法以及如何实现这一点,因为要加入的列。
我的第一个table包含国内所有地址;邮政编码 + 地址号码导致独特的组合。每个地址还与每个县的特定社区和区域相关。此 table 不包含任何其他信息。
我的第二个 table 包含有关每个社区、地区和县的相关信息,例如居住数量、居民年龄、能源消耗等。我们的想法只是将这些信息与完整地址列表合并这样我就可以查看该国家/地区每个地址的这些统计信息。
让我头疼的是两个 table 的格式不同。
第一个table格式如下,地址+邮政编码的每一个组合都是唯一的(但不同的地址可以在同一个县,地区或街区):
adresses <- data.frame("postal_code" = c("1000A", "1010A", "1000B", "1100B", "1500C", "2700C"),
"adress_nr" = c(1, 2, 3, 15, 1, 35),
"neighborhood" = c("A1", "A2", "B1", "B1", "C5", "C7"),
"area" = c("AA1", "AA2", "BB2", "BB1", "CC1", "CC3"),
"county" = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC")
)
第二个 table 具有长格式,其中一列包含 BOTH 社区和区域(每个总体县)的所有唯一值:
neighborhood_area_data <- data.frame(
"county" = c("AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "CCC", "CCC", "CCC"),
"neighborhood_and_area" = c("NEIGH_A1", "AREA_AA1", "AREA_AA2", "NEIGH_A2", "AREA_BB2", "AREA_BB1", "NEIGH_B1", "NEIGH_C5", "NEIGH_C7", "AREA_CC3", "AREA_CC7"),
"type" = c("Neighborhood", "Area", "Area", "Neighborhood", "Area", "Area", "Neighborhood", "Neighborhood", "Neighboordhood", "Area", "Area"),
"Number_of_Residents" = c(10, 50, 40, 30, 100, 70, 80, 60, 70, 70, 20),
"Average_Age" = c(55, 44, 33, 22, 66, 77, 55, 88, 99, 44, 11))
因此,对于每个总体县,您将拥有 所有 其现有区域和街区的数据。 ID 存储在单个列中,因此是长格式。字符串的“NEIGH_”和“AREA_”部分标识它是邻域还是区域,我从字符串中删除它们以便能够加入它们)。
在我的示例中,感兴趣的数据是 Number_of_Residents 和 Average_Age 列,我想将它们连接到各个地址 table.
我正在寻找的是一个可靠的 approach/way 来组合这些 table(最好通过 dplyr)。
我最初的方法是采用第二个 table 并将 neighborhood_and_area 分成单独的列(社区和区域),同时删除标识符(例如,“NEIGH_AA1”-> “AA1”字符串的一部分)。但是,因为没有 summarizing/pivoting,第二个 table 保留其原始格式,无法正确加入。我不确定 best/most 协调这两种格式的优雅方式是什么。
希望我的问题和例子很清楚!谢谢!
假设您希望同时存储邻域和区域数据:
library(tidyverse)
area_data <-
neighborhood_area_data %>%
separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>%
filter(grepl('Area', type)) %>%
rename(Area_Number_of_Residents = Number_of_Residents,
Area_Average_Age = Average_Age) %>%
select(-type)
neighborhood_data <-
neighborhood_area_data %>%
separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>%
filter(!grepl('Area', type)) %>%
rename(Neighborhood_Number_of_Residents = Number_of_Residents,
Neighborhood_Average_Age = Average_Age) %>%
select(-type)
然后您可以加入每个拆分数据集:
adresses %>%
left_join(area_data,
by = c('county', 'area' = 'code')) %>%
left_join(neighborhood_data,
by = c('county', 'neighborhood' = 'code'))
输出:
postal_code adress_nr neighborhood area county Area_Number_of_Residents Area_Average_Age Neighborhood_Number_of_Residents Neighborhood_Average_Age
1 1000A 1 A1 AA1 AAA 50 44 10 55
2 1010A 2 A2 AA2 AAA 40 33 30 22
3 1000B 3 B1 BB2 BBB 100 66 80 55
4 1100B 15 B1 BB1 BBB 70 77 80 55
5 1500C 1 C5 CC1 CCC NA NA NA NA
6 2700C 35 C7 CC3 CCC 70 44 70 99
我正在处理一些我想连接在一起的 public 地址数据,但我不确定最佳方法以及如何实现这一点,因为要加入的列。
我的第一个table包含国内所有地址;邮政编码 + 地址号码导致独特的组合。每个地址还与每个县的特定社区和区域相关。此 table 不包含任何其他信息。
我的第二个 table 包含有关每个社区、地区和县的相关信息,例如居住数量、居民年龄、能源消耗等。我们的想法只是将这些信息与完整地址列表合并这样我就可以查看该国家/地区每个地址的这些统计信息。
让我头疼的是两个 table 的格式不同。
第一个table格式如下,地址+邮政编码的每一个组合都是唯一的(但不同的地址可以在同一个县,地区或街区):
adresses <- data.frame("postal_code" = c("1000A", "1010A", "1000B", "1100B", "1500C", "2700C"),
"adress_nr" = c(1, 2, 3, 15, 1, 35),
"neighborhood" = c("A1", "A2", "B1", "B1", "C5", "C7"),
"area" = c("AA1", "AA2", "BB2", "BB1", "CC1", "CC3"),
"county" = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC")
)
第二个 table 具有长格式,其中一列包含 BOTH 社区和区域(每个总体县)的所有唯一值:
neighborhood_area_data <- data.frame(
"county" = c("AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "CCC", "CCC", "CCC"),
"neighborhood_and_area" = c("NEIGH_A1", "AREA_AA1", "AREA_AA2", "NEIGH_A2", "AREA_BB2", "AREA_BB1", "NEIGH_B1", "NEIGH_C5", "NEIGH_C7", "AREA_CC3", "AREA_CC7"),
"type" = c("Neighborhood", "Area", "Area", "Neighborhood", "Area", "Area", "Neighborhood", "Neighborhood", "Neighboordhood", "Area", "Area"),
"Number_of_Residents" = c(10, 50, 40, 30, 100, 70, 80, 60, 70, 70, 20),
"Average_Age" = c(55, 44, 33, 22, 66, 77, 55, 88, 99, 44, 11))
因此,对于每个总体县,您将拥有 所有 其现有区域和街区的数据。 ID 存储在单个列中,因此是长格式。字符串的“NEIGH_”和“AREA_”部分标识它是邻域还是区域,我从字符串中删除它们以便能够加入它们)。
在我的示例中,感兴趣的数据是 Number_of_Residents 和 Average_Age 列,我想将它们连接到各个地址 table.
我正在寻找的是一个可靠的 approach/way 来组合这些 table(最好通过 dplyr)。
我最初的方法是采用第二个 table 并将 neighborhood_and_area 分成单独的列(社区和区域),同时删除标识符(例如,“NEIGH_AA1”-> “AA1”字符串的一部分)。但是,因为没有 summarizing/pivoting,第二个 table 保留其原始格式,无法正确加入。我不确定 best/most 协调这两种格式的优雅方式是什么。
希望我的问题和例子很清楚!谢谢!
假设您希望同时存储邻域和区域数据:
library(tidyverse)
area_data <-
neighborhood_area_data %>%
separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>%
filter(grepl('Area', type)) %>%
rename(Area_Number_of_Residents = Number_of_Residents,
Area_Average_Age = Average_Age) %>%
select(-type)
neighborhood_data <-
neighborhood_area_data %>%
separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>%
filter(!grepl('Area', type)) %>%
rename(Neighborhood_Number_of_Residents = Number_of_Residents,
Neighborhood_Average_Age = Average_Age) %>%
select(-type)
然后您可以加入每个拆分数据集:
adresses %>%
left_join(area_data,
by = c('county', 'area' = 'code')) %>%
left_join(neighborhood_data,
by = c('county', 'neighborhood' = 'code'))
输出:
postal_code adress_nr neighborhood area county Area_Number_of_Residents Area_Average_Age Neighborhood_Number_of_Residents Neighborhood_Average_Age
1 1000A 1 A1 AA1 AAA 50 44 10 55
2 1010A 2 A2 AA2 AAA 40 33 30 22
3 1000B 3 B1 BB2 BBB 100 66 80 55
4 1100B 15 B1 BB1 BBB 70 77 80 55
5 1500C 1 C5 CC1 CCC NA NA NA NA
6 2700C 35 C7 CC3 CCC 70 44 70 99