将城市名称和地理位置数据添加到数据框

Adding city names and geolocation data to dataframe

我有一个包含超过 20.000 个观察值的数据集,基本上看起来像这样:

df <- data.frame(
    user = c("ABC", "DEF", "GHI"),
    location = c("Chicago, the windy city", "Oxford University", "Paris")
)

我想添加另外三个列 citylonglat,并在这些列中填入城市名称和地理位置(经度和纬度)。

因此我想使用 maps 包及其 world.cities 数据库:

library(maps)
data(world.cities)

如果 location 中的城市名称能够以正确的方式显示,添加城市名称和地理位置并不困难。然而,它们中的大多数确实有额外的字符串(例如 "Chicago, the windy city")。

如何根据 world.cities 数据库提取城市名称并将真实城市名称写入列 city 并将地理位置写入 longlat?

正如@Heroka 在评论中提到的,如果城市名称始终是 location 中的第一个字符串,您可以使用 stringi、[=19= 提取第一个字符串] world.cities 数据,并筛选匹配中最大的人口。

library(stringi)
library(dplyr)

df %>%
  mutate(city = stri_extract_first_words(location)) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

给出:

#Source: local data frame [3 x 8]
#Groups: city [3]
#
#    user                location    city country.etc     pop   lat   long capital
#  (fctr)                  (fctr)   (chr)       (chr)   (int) (dbl)  (dbl)   (int)
#1    ABC Chicago, the windy city Chicago         USA 2830144 41.84 -87.68       0
#2    DEF       Oxford University  Oxford          UK  157568 51.76  -1.26       0
#3    GHI                   Paris   Paris      France 2141839 48.86   2.34       1

更新

如果城市名称并不总是 location 中的第一个字符串,您可以先尝试将 location 中的单词与字典匹配(这里是 name 列在 world.cities) 中,然后使用匹配 return TRUE 作为您的位置名称。这是一个快速实现(我为您添加了 "University College London" 案例 data.frame)

> df
#  user                  location
#1  ABC   Chicago, the windy city
#2  DEF         Oxford University
#3  GHI                     Paris
#4  JKL University College London

对于每一行,我们提取location中的所有单词并将它们存储在列表lst中,遍历它以找到匹配name在[=中的位置67=]并存入p,最后在lst中提取位置p对应的元素存入city

df %>%
  mutate(lst = stri_extract_all_words(location),
         p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
  mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1) 

您还可以通过添加 ... %>% select(-lst, -p)

来删除临时列 plst

更新 2

这不应因格式错误的单词而中断,但不适用于 "New York" 情况:

df %>%
  mutate(
    city = lapply(stri_extract_all_words(location), 
                  function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
  tidyr::unnest(city) %>%
  filter(!is.na(city)) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

更新 3

这应该适用于所有情况:

> df
#  user                  location
#1  ABC   Chicago, the windy city
#2  DEF         Oxford University
#3  GHI                     Paris
#4  JKL                  New York
#5  MNO                  m0ntr3al
#6  PQR University College London

df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst  <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m    <- data.table::melt(lst)

df %>% 
  left_join(m, by = c("l" = "value")) %>%
  left_join(world.cities %>% 
              add_rownames %>% 
              mutate(rowname = as.numeric(rowname)), 
            by = c("L1" = "rowname")) %>% 
  tidyr::replace_na(list(pop = 0)) %>%
  group_by(location) %>%
  filter(row_number(desc(pop)) == 1) %>%
  select(-(l:L1))

给出:

#Source: local data frame [6 x 8]
#Groups: location [6]
#
#    user                  location     name country.etc     pop   lat   long capital
#  (fctr)                    (fctr)    (chr)       (chr)   (dbl) (dbl)  (dbl)   (int)
#1    ABC   Chicago, the windy city  Chicago         USA 2830144 41.84 -87.68       0
#2    DEF         Oxford University   Oxford          UK  157568 51.76  -1.26       0
#3    GHI                     Paris    Paris      France 2141839 48.86   2.34       1
#4    JKL                  New York New York         USA 8124427 40.67 -73.94       0
#5    MNO                  m0ntr3al       NA          NA       0    NA     NA      NA
#6    PQR Univeristy College London   London          UK 7489022 51.52  -0.10       1