将城市名称和地理位置数据添加到数据框
Adding city names and geolocation data to dataframe
我有一个包含超过 20.000 个观察值的数据集,基本上看起来像这样:
df <- data.frame(
user = c("ABC", "DEF", "GHI"),
location = c("Chicago, the windy city", "Oxford University", "Paris")
)
我想添加另外三个列 city
、long
、lat
,并在这些列中填入城市名称和地理位置(经度和纬度)。
因此我想使用 maps
包及其 world.cities
数据库:
library(maps)
data(world.cities)
如果 location
中的城市名称能够以正确的方式显示,添加城市名称和地理位置并不困难。然而,它们中的大多数确实有额外的字符串(例如 "Chicago, the windy city")。
如何根据 world.cities
数据库提取城市名称并将真实城市名称写入列 city
并将地理位置写入 long
和 lat
?
正如@Heroka 在评论中提到的,如果城市名称始终是 location
中的第一个字符串,您可以使用 stringi
、[=19= 提取第一个字符串] world.cities 数据,并筛选匹配中最大的人口。
library(stringi)
library(dplyr)
df %>%
mutate(city = stri_extract_first_words(location)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
给出:
#Source: local data frame [3 x 8]
#Groups: city [3]
#
# user location city country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (int) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
更新
如果城市名称并不总是 location
中的第一个字符串,您可以先尝试将 location
中的单词与字典匹配(这里是 name
列在 world.cities) 中,然后使用匹配 return TRUE
作为您的位置名称。这是一个快速实现(我为您添加了 "University College London" 案例 data.frame)
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL University College London
对于每一行,我们提取location
中的所有单词并将它们存储在列表lst
中,遍历它以找到匹配name
在[=中的位置67=]并存入p
,最后在lst
中提取位置p
对应的元素存入city
df %>%
mutate(lst = stri_extract_all_words(location),
p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
您还可以通过添加 ... %>% select(-lst, -p)
来删除临时列 p
和 lst
更新 2
这不应因格式错误的单词而中断,但不适用于 "New York" 情况:
df %>%
mutate(
city = lapply(stri_extract_all_words(location),
function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
tidyr::unnest(city) %>%
filter(!is.na(city)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
更新 3
这应该适用于所有情况:
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL New York
#5 MNO m0ntr3al
#6 PQR University College London
df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m <- data.table::melt(lst)
df %>%
left_join(m, by = c("l" = "value")) %>%
left_join(world.cities %>%
add_rownames %>%
mutate(rowname = as.numeric(rowname)),
by = c("L1" = "rowname")) %>%
tidyr::replace_na(list(pop = 0)) %>%
group_by(location) %>%
filter(row_number(desc(pop)) == 1) %>%
select(-(l:L1))
给出:
#Source: local data frame [6 x 8]
#Groups: location [6]
#
# user location name country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (dbl) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
#4 JKL New York New York USA 8124427 40.67 -73.94 0
#5 MNO m0ntr3al NA NA 0 NA NA NA
#6 PQR Univeristy College London London UK 7489022 51.52 -0.10 1
我有一个包含超过 20.000 个观察值的数据集,基本上看起来像这样:
df <- data.frame(
user = c("ABC", "DEF", "GHI"),
location = c("Chicago, the windy city", "Oxford University", "Paris")
)
我想添加另外三个列 city
、long
、lat
,并在这些列中填入城市名称和地理位置(经度和纬度)。
因此我想使用 maps
包及其 world.cities
数据库:
library(maps)
data(world.cities)
如果 location
中的城市名称能够以正确的方式显示,添加城市名称和地理位置并不困难。然而,它们中的大多数确实有额外的字符串(例如 "Chicago, the windy city")。
如何根据 world.cities
数据库提取城市名称并将真实城市名称写入列 city
并将地理位置写入 long
和 lat
?
正如@Heroka 在评论中提到的,如果城市名称始终是 location
中的第一个字符串,您可以使用 stringi
、[=19= 提取第一个字符串] world.cities 数据,并筛选匹配中最大的人口。
library(stringi)
library(dplyr)
df %>%
mutate(city = stri_extract_first_words(location)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
给出:
#Source: local data frame [3 x 8]
#Groups: city [3]
#
# user location city country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (int) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
更新
如果城市名称并不总是 location
中的第一个字符串,您可以先尝试将 location
中的单词与字典匹配(这里是 name
列在 world.cities) 中,然后使用匹配 return TRUE
作为您的位置名称。这是一个快速实现(我为您添加了 "University College London" 案例 data.frame)
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL University College London
对于每一行,我们提取location
中的所有单词并将它们存储在列表lst
中,遍历它以找到匹配name
在[=中的位置67=]并存入p
,最后在lst
中提取位置p
对应的元素存入city
df %>%
mutate(lst = stri_extract_all_words(location),
p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
您还可以通过添加 ... %>% select(-lst, -p)
p
和 lst
更新 2
这不应因格式错误的单词而中断,但不适用于 "New York" 情况:
df %>%
mutate(
city = lapply(stri_extract_all_words(location),
function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
tidyr::unnest(city) %>%
filter(!is.na(city)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
更新 3
这应该适用于所有情况:
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL New York
#5 MNO m0ntr3al
#6 PQR University College London
df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m <- data.table::melt(lst)
df %>%
left_join(m, by = c("l" = "value")) %>%
left_join(world.cities %>%
add_rownames %>%
mutate(rowname = as.numeric(rowname)),
by = c("L1" = "rowname")) %>%
tidyr::replace_na(list(pop = 0)) %>%
group_by(location) %>%
filter(row_number(desc(pop)) == 1) %>%
select(-(l:L1))
给出:
#Source: local data frame [6 x 8]
#Groups: location [6]
#
# user location name country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (dbl) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
#4 JKL New York New York USA 8124427 40.67 -73.94 0
#5 MNO m0ntr3al NA NA 0 NA NA NA
#6 PQR Univeristy College London London UK 7489022 51.52 -0.10 1