如何对列中的值进行分组 (R)

Question

我正在创建一个摘要table，按目的地国家/地区对我的记录进行分组：

SummarybyLocation <- PSTNRecords %>% 
                           group_by(Destination) %>%
                                summarize(
                                  Calls = n(), 
                                  Minutes = sum(durationMinutes), 
                                  MaxDuration = max(durationMinutes),
                                  AverageDuration = mean(durationMinutes), 
                                  Charges = sum(charge),
                                  Fees = sum(connectionCharge)
                                )

SummarybyLocation

结果table如下：

我意识到目标值不一致（例如，“法国”和“FR”都指的是同一个区域，然后我有一个“北美”，我认为它聚集了美国和加拿大。

我想知道是否有一种方法可以为这些值创建自定义组，以便聚合更有意义。我尝试使用 countrycode 包来添加 iso2c 列，但这并没有解决管理其他区域聚合（如“北美”）的问题。

我非常感谢有关如何处理此问题的一些建议。

提前致谢！

Answer 1

这是一种用一个非常小的例子来清理数据的可能性。首先，我得到一个国家名称列表以及 2 个和 3 个字母的缩写，然后放入一个数据框中，countries。然后，我 left_join countries 到 df 为两个字母代码，在本例中匹配 FR。然后，我重复 left_join 但使用 3 字母代码，在这种情况下没有匹配项。然后，我 coalesce 将两个新列放在一起，即 Country.x 和 Country.y。然后，我将 case_when 用于多个 if-else 语句。首先，如果 Country 不是 NA，那么我将 Destination 替换为完整的国家/地区名称。如果您有其他项目（例如，欧洲）可能还需要修复，您可以在此处添加其他参数。接下来，我将 North America 替换为“United States-Canada-Mexico”。最后，我删除了以“国家/地区”开头的列。

library(XML)
library(RCurl)
library(rlist)
library(tidyverse)

theurl <-
  getURL("https://www.iban.com/country-codes",
         .opts = list(ssl.verifypeer = FALSE))
countries <- readHTMLTable(theurl)
countries <-
  list.clean(countries, fun = is.null, recursive = FALSE)[[1]]


df %>%
  left_join(.,
            countries %>% select(Country, `Alpha-2 code`),
            by = c("Destination" = "Alpha-2 code")) %>%
  left_join(.,
            countries %>% select(Country, `Alpha-3 code`),
            by = c("Destination" = "Alpha-3 code")) %>%
  mutate(
    Country = coalesce(Country.x, Country.y),
    Destination = case_when(!is.na(Country) ~ Country,
                            Destination == "North America" ~ "United States-Canada-Mexico",
                            TRUE ~ Destination
  )) %>%
select(-c(starts_with("Country")))

输出

                  Destination durationMinutes charge connectionCharge
1                      France            6.57   0.00                0
2                      France            3.34   1.94                0
3               United States          234.40   3.00                0
4 United States-Canada-Mexico           23.40   2.00                0

但是，如果您有很多不同的变体，那么您可能只想创建一个带有替换的简单数据框，这样您就可以只做一个 left_join.

另一种选择是也添加大陆列，您可以从 countrycode.

中获取该列

library(countrycode)

countrycode(sourcevar = df$Destination,
            origin = "country.name",
            destination = "continent")

[1] NA         "Europe"   "Americas" NA

数据

df <- structure(list(Destination = c("FR", "France", "United States", 
"North America"), durationMinutes = c(6.57, 3.34, 234.4, 23.4
), charge = c(0, 1.94, 3, 2), connectionCharge = c(0, 0, 0, 0
)), class = "data.frame", row.names = c(NA, -4L))

Answer 2

{countrycode} 包可以轻松处理自定义 names/codes...

library(tidyverse)
library(countrycode)

PSTNRecords <- tibble::tribble(
  ~Destination,    ~durationMinutes, ~charge, ~connectionCharge,
  "FR",            1,                2.5,     0.3,
  "France",        1,                2.5,     0.3,
  "United States", 1,                2.5,     0.3,
  "USA",           1,                2.5,     0.3,
  "North America", 1,                2.5,     0.3
)

# see what special codes/country names you have to deal with
iso3cs <- countrycode(PSTNRecords$Destination, "country.name", "iso3c", warn = FALSE)
unique(PSTNRecords$Destination[is.na(iso3cs)])
#> [1] "FR"            "North America"

# decde how to deal with them
custom_matches <- c("FR" = "FRA", "North America" = "USA")

# use your custom codes
PSTNRecords %>%
  mutate(iso3c = countrycode(Destination, "country.name", "iso3c", custom_match = custom_matches))
#> # A tibble: 5 × 5
#>   Destination   durationMinutes charge connectionCharge iso3c
#>   <chr>                   <dbl>  <dbl>            <dbl> <chr>
#> 1 FR                          1    2.5              0.3 FRA  
#> 2 France                      1    2.5              0.3 FRA  
#> 3 United States               1    2.5              0.3 USA  
#> 4 USA                         1    2.5              0.3 USA  
#> 5 North America               1    2.5              0.3 USA

如何对列中的值进行分组 (R)

How to group values in a column (R)

grouping

r

country-codes