R:不可能完成的任务?如何将 "New York" 分配给一个县
R: Mission impossible? How to assign "New York" to a county
我 运行 遇到将县分配给某些城市的问题。通过 acs
包查询时
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
,你可以看到,比如“纽约”,有一堆县。洛杉矶、波特兰、俄克拉荷马州、哥伦布等也是如此。这样的数据怎么能分配给一个“县”?
以下代码当前用于将“county.name”与相应的县 FIPS 代码匹配。不幸的是,它只适用于查询中只输出一个县名的情况。
脚本
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
为了保留数据,left_join 最好匹配为“查找包含 place.name
(没有在名称中附加 xy city),或者默认选择第一项。很高兴看到如何做到这一点。
总的来说:我想,没有比这种方法更好的方法了吗?
感谢您的帮助!
像下面这样的代码如何创建一个 "long" 数据框用于加入。我们使用 tidyverse
管道运算符来链接操作。 strsplit
returns 一个列表,我们 unnest
将列表值(与 state.name
和 place.name
的每个组合对应的县名)堆叠成一个长每个 county.name
现在都有自己的行的数据框。
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
更新:关于你评论中的第二个问题,假设你已经有了都市区的向量,那么这个怎么样:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
更新 2: 如果我理解您的意见,对于具有多个县的城市(实际上是示例中的地名),我们只需要包含相同名称的县作为城市(例如,纽约市的纽约县),否则列表中的第一个县。以下代码选择一个与城市同名的县,如果没有,则选择该城市的第一个县。您可能需要稍微调整一下以使其适用于整个 U.S。例如,要使其适用于路易斯安那州,您可能需要 gsub(" County| Parish"...
而不是 gsub(" County"...
.
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County
你能用类似下面的代码来准备数据吗?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
有点乱,您需要 plyr 和 stringr 包。准备好数据后,您应该可以加入其中
我 运行 遇到将县分配给某些城市的问题。通过 acs
包查询时
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
,你可以看到,比如“纽约”,有一堆县。洛杉矶、波特兰、俄克拉荷马州、哥伦布等也是如此。这样的数据怎么能分配给一个“县”?
以下代码当前用于将“county.name”与相应的县 FIPS 代码匹配。不幸的是,它只适用于查询中只输出一个县名的情况。
脚本
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
为了保留数据,left_join 最好匹配为“查找包含 place.name
(没有在名称中附加 xy city),或者默认选择第一项。很高兴看到如何做到这一点。
总的来说:我想,没有比这种方法更好的方法了吗?
感谢您的帮助!
像下面这样的代码如何创建一个 "long" 数据框用于加入。我们使用 tidyverse
管道运算符来链接操作。 strsplit
returns 一个列表,我们 unnest
将列表值(与 state.name
和 place.name
的每个组合对应的县名)堆叠成一个长每个 county.name
现在都有自己的行的数据框。
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name 1 36 New York <NA> NA <NA> 2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city 3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name <chr> <chr> <int> <chr> <chr> 1 36 New York NA <NA> <NA> 2 36 New York 51000 New York city Bronx County 3 36 New York 51000 New York city Kings County 4 36 New York 51000 New York city New York County 5 36 New York 51000 New York city Queens County 6 36 New York 51000 New York city Richmond County 7 36 New York 51011 New York Mills village Oneida County
更新:关于你评论中的第二个问题,假设你已经有了都市区的向量,那么这个怎么样:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name 1 36 New York 51000 New York city Bronx County 2 36 New York 51000 New York city Kings County 3 36 New York 51000 New York city New York County 4 36 New York 51000 New York city Queens County 5 36 New York 51000 New York city Richmond County 6 36 New York 51011 New York Mills village Oneida County 7 25 Massachusetts 7000 Boston city Suffolk County 8 25 Massachusetts 7000 Boston city Suffolk County 9 6 California 20802 East Los Angeles CDP Los Angeles County 10 6 California 39612 Lake Los Angeles CDP Los Angeles County 11 6 California 44000 Los Angeles city Los Angeles County 12 48 Texas 19000 Dallas city Collin County 13 48 Texas 19000 Dallas city Dallas County 14 48 Texas 19000 Dallas city Denton County 15 48 Texas 19000 Dallas city Kaufman County 16 48 Texas 19000 Dallas city Rockwall County 17 48 Texas 40516 Lake Dallas city Denton County 18 6 California 20956 East Palo Alto city San Mateo County 19 6 California 55282 Palo Alto city Santa Clara County
更新 2: 如果我理解您的意见,对于具有多个县的城市(实际上是示例中的地名),我们只需要包含相同名称的县作为城市(例如,纽约市的纽约县),否则列表中的第一个县。以下代码选择一个与城市同名的县,如果没有,则选择该城市的第一个县。您可能需要稍微调整一下以使其适用于整个 U.S。例如,要使其适用于路易斯安那州,您可能需要 gsub(" County| Parish"...
而不是 gsub(" County"...
.
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name <chr> <chr> <int> <chr> <chr> 1 36 New York 51000 New York city New York County 2 36 New York 51011 New York Mills village Oneida County 3 25 Massachusetts 7000 Boston city Suffolk County 4 6 California 20802 East Los Angeles CDP Los Angeles County 5 6 California 39612 Lake Los Angeles CDP Los Angeles County 6 6 California 44000 Los Angeles city Los Angeles County 7 48 Texas 19000 Dallas city Dallas County 8 48 Texas 40516 Lake Dallas city Denton County 9 6 California 20956 East Palo Alto city San Mateo County 10 6 California 55282 Palo Alto city Santa Clara County
你能用类似下面的代码来准备数据吗?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
有点乱,您需要 plyr 和 stringr 包。准备好数据后,您应该可以加入其中