如何对列中的值进行分组 (R)

How to group values in a column (R)

我正在创建一个摘要table,按目的地国家/地区对我的记录进行分组:

SummarybyLocation <- PSTNRecords %>% 
                           group_by(Destination) %>%
                                summarize(
                                  Calls = n(), 
                                  Minutes = sum(durationMinutes), 
                                  MaxDuration = max(durationMinutes),
                                  AverageDuration = mean(durationMinutes), 
                                  Charges = sum(charge),
                                  Fees = sum(connectionCharge)
                                )

SummarybyLocation

结果table如下:

我意识到目标值不一致(例如,“法国”和“FR”都指的是同一个区域,然后我有一个“北美”,我认为它聚集了美国和加拿大。

我想知道是否有一种方法可以为这些值创建自定义组,以便聚合更有意义。我尝试使用 countrycode 包来添加 iso2c 列,但这并没有解决管理其他区域聚合(如“北美”)的问题。

我非常感谢有关如何处理此问题的一些建议。

提前致谢!

这是一种用一个非常小的例子来清理数据的可能性。首先,我得到一个国家名称列表以及 2 个和 3 个字母的缩写,然后放入一个数据框中,countries。然后,我 left_join countriesdf 为两个字母代码,在本例中匹配 FR。然后,我重复 left_join 但使用 3 字母代码,在这种情况下没有匹配项。然后,我 coalesce 将两个新列放在一起,即 Country.xCountry.y。然后,我将 case_when 用于多个 if-else 语句。首先,如果 Country 不是 NA,那么我将 Destination 替换为完整的国家/地区名称。如果您有其他项目(例如,欧洲)可能还需要修复,您可以在此处添加其他参数。接下来,我将 North America 替换为“United States-Canada-Mexico”。最后,我删除了以“国家/地区”开头的列。

library(XML)
library(RCurl)
library(rlist)
library(tidyverse)

theurl <-
  getURL("https://www.iban.com/country-codes",
         .opts = list(ssl.verifypeer = FALSE))
countries <- readHTMLTable(theurl)
countries <-
  list.clean(countries, fun = is.null, recursive = FALSE)[[1]]


df %>%
  left_join(.,
            countries %>% select(Country, `Alpha-2 code`),
            by = c("Destination" = "Alpha-2 code")) %>%
  left_join(.,
            countries %>% select(Country, `Alpha-3 code`),
            by = c("Destination" = "Alpha-3 code")) %>%
  mutate(
    Country = coalesce(Country.x, Country.y),
    Destination = case_when(!is.na(Country) ~ Country,
                            Destination == "North America" ~ "United States-Canada-Mexico",
                            TRUE ~ Destination
  )) %>%
select(-c(starts_with("Country")))

输出

                  Destination durationMinutes charge connectionCharge
1                      France            6.57   0.00                0
2                      France            3.34   1.94                0
3               United States          234.40   3.00                0
4 United States-Canada-Mexico           23.40   2.00                0

但是,如果您有很多不同的变体,那么您可能只想创建一个带有替换的简单数据框,这样您就可以只做一个 left_join.

另一种选择是也添加大陆列,您可以从 countrycode.

中获取该列
library(countrycode)

countrycode(sourcevar = df$Destination,
            origin = "country.name",
            destination = "continent")

[1] NA         "Europe"   "Americas" NA   

数据

df <- structure(list(Destination = c("FR", "France", "United States", 
"North America"), durationMinutes = c(6.57, 3.34, 234.4, 23.4
), charge = c(0, 1.94, 3, 2), connectionCharge = c(0, 0, 0, 0
)), class = "data.frame", row.names = c(NA, -4L))

{countrycode} 包可以轻松处理自定义 names/codes...

library(tidyverse)
library(countrycode)

PSTNRecords <- tibble::tribble(
  ~Destination,    ~durationMinutes, ~charge, ~connectionCharge,
  "FR",            1,                2.5,     0.3,
  "France",        1,                2.5,     0.3,
  "United States", 1,                2.5,     0.3,
  "USA",           1,                2.5,     0.3,
  "North America", 1,                2.5,     0.3
)

# see what special codes/country names you have to deal with
iso3cs <- countrycode(PSTNRecords$Destination, "country.name", "iso3c", warn = FALSE)
unique(PSTNRecords$Destination[is.na(iso3cs)])
#> [1] "FR"            "North America"

# decde how to deal with them
custom_matches <- c("FR" = "FRA", "North America" = "USA")

# use your custom codes
PSTNRecords %>%
  mutate(iso3c = countrycode(Destination, "country.name", "iso3c", custom_match = custom_matches))
#> # A tibble: 5 × 5
#>   Destination   durationMinutes charge connectionCharge iso3c
#>   <chr>                   <dbl>  <dbl>            <dbl> <chr>
#> 1 FR                          1    2.5              0.3 FRA  
#> 2 France                      1    2.5              0.3 FRA  
#> 3 United States               1    2.5              0.3 USA  
#> 4 USA                         1    2.5              0.3 USA  
#> 5 North America               1    2.5              0.3 USA