如何对列中的值进行分组 (R)
How to group values in a column (R)
我正在创建一个摘要table,按目的地国家/地区对我的记录进行分组:
SummarybyLocation <- PSTNRecords %>%
group_by(Destination) %>%
summarize(
Calls = n(),
Minutes = sum(durationMinutes),
MaxDuration = max(durationMinutes),
AverageDuration = mean(durationMinutes),
Charges = sum(charge),
Fees = sum(connectionCharge)
)
SummarybyLocation
结果table如下:
我意识到目标值不一致(例如,“法国”和“FR”都指的是同一个区域,然后我有一个“北美”,我认为它聚集了美国和加拿大。
我想知道是否有一种方法可以为这些值创建自定义组,以便聚合更有意义。我尝试使用 countrycode 包来添加 iso2c 列,但这并没有解决管理其他区域聚合(如“北美”)的问题。
我非常感谢有关如何处理此问题的一些建议。
提前致谢!
这是一种用一个非常小的例子来清理数据的可能性。首先,我得到一个国家名称列表以及 2 个和 3 个字母的缩写,然后放入一个数据框中,countries
。然后,我 left_join
countries
到 df
为两个字母代码,在本例中匹配 FR
。然后,我重复 left_join
但使用 3 字母代码,在这种情况下没有匹配项。然后,我 coalesce
将两个新列放在一起,即 Country.x
和 Country.y
。然后,我将 case_when
用于多个 if-else 语句。首先,如果 Country
不是 NA,那么我将 Destination
替换为完整的国家/地区名称。如果您有其他项目(例如,欧洲)可能还需要修复,您可以在此处添加其他参数。接下来,我将 North America
替换为“United States-Canada-Mexico”。最后,我删除了以“国家/地区”开头的列。
library(XML)
library(RCurl)
library(rlist)
library(tidyverse)
theurl <-
getURL("https://www.iban.com/country-codes",
.opts = list(ssl.verifypeer = FALSE))
countries <- readHTMLTable(theurl)
countries <-
list.clean(countries, fun = is.null, recursive = FALSE)[[1]]
df %>%
left_join(.,
countries %>% select(Country, `Alpha-2 code`),
by = c("Destination" = "Alpha-2 code")) %>%
left_join(.,
countries %>% select(Country, `Alpha-3 code`),
by = c("Destination" = "Alpha-3 code")) %>%
mutate(
Country = coalesce(Country.x, Country.y),
Destination = case_when(!is.na(Country) ~ Country,
Destination == "North America" ~ "United States-Canada-Mexico",
TRUE ~ Destination
)) %>%
select(-c(starts_with("Country")))
输出
Destination durationMinutes charge connectionCharge
1 France 6.57 0.00 0
2 France 3.34 1.94 0
3 United States 234.40 3.00 0
4 United States-Canada-Mexico 23.40 2.00 0
但是,如果您有很多不同的变体,那么您可能只想创建一个带有替换的简单数据框,这样您就可以只做一个 left_join
.
另一种选择是也添加大陆列,您可以从 countrycode
.
中获取该列
library(countrycode)
countrycode(sourcevar = df$Destination,
origin = "country.name",
destination = "continent")
[1] NA "Europe" "Americas" NA
数据
df <- structure(list(Destination = c("FR", "France", "United States",
"North America"), durationMinutes = c(6.57, 3.34, 234.4, 23.4
), charge = c(0, 1.94, 3, 2), connectionCharge = c(0, 0, 0, 0
)), class = "data.frame", row.names = c(NA, -4L))
{countrycode}
包可以轻松处理自定义 names/codes...
library(tidyverse)
library(countrycode)
PSTNRecords <- tibble::tribble(
~Destination, ~durationMinutes, ~charge, ~connectionCharge,
"FR", 1, 2.5, 0.3,
"France", 1, 2.5, 0.3,
"United States", 1, 2.5, 0.3,
"USA", 1, 2.5, 0.3,
"North America", 1, 2.5, 0.3
)
# see what special codes/country names you have to deal with
iso3cs <- countrycode(PSTNRecords$Destination, "country.name", "iso3c", warn = FALSE)
unique(PSTNRecords$Destination[is.na(iso3cs)])
#> [1] "FR" "North America"
# decde how to deal with them
custom_matches <- c("FR" = "FRA", "North America" = "USA")
# use your custom codes
PSTNRecords %>%
mutate(iso3c = countrycode(Destination, "country.name", "iso3c", custom_match = custom_matches))
#> # A tibble: 5 × 5
#> Destination durationMinutes charge connectionCharge iso3c
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 FR 1 2.5 0.3 FRA
#> 2 France 1 2.5 0.3 FRA
#> 3 United States 1 2.5 0.3 USA
#> 4 USA 1 2.5 0.3 USA
#> 5 North America 1 2.5 0.3 USA
我正在创建一个摘要table,按目的地国家/地区对我的记录进行分组:
SummarybyLocation <- PSTNRecords %>%
group_by(Destination) %>%
summarize(
Calls = n(),
Minutes = sum(durationMinutes),
MaxDuration = max(durationMinutes),
AverageDuration = mean(durationMinutes),
Charges = sum(charge),
Fees = sum(connectionCharge)
)
SummarybyLocation
结果table如下:
我意识到目标值不一致(例如,“法国”和“FR”都指的是同一个区域,然后我有一个“北美”,我认为它聚集了美国和加拿大。
我想知道是否有一种方法可以为这些值创建自定义组,以便聚合更有意义。我尝试使用 countrycode 包来添加 iso2c 列,但这并没有解决管理其他区域聚合(如“北美”)的问题。
我非常感谢有关如何处理此问题的一些建议。
提前致谢!
这是一种用一个非常小的例子来清理数据的可能性。首先,我得到一个国家名称列表以及 2 个和 3 个字母的缩写,然后放入一个数据框中,countries
。然后,我 left_join
countries
到 df
为两个字母代码,在本例中匹配 FR
。然后,我重复 left_join
但使用 3 字母代码,在这种情况下没有匹配项。然后,我 coalesce
将两个新列放在一起,即 Country.x
和 Country.y
。然后,我将 case_when
用于多个 if-else 语句。首先,如果 Country
不是 NA,那么我将 Destination
替换为完整的国家/地区名称。如果您有其他项目(例如,欧洲)可能还需要修复,您可以在此处添加其他参数。接下来,我将 North America
替换为“United States-Canada-Mexico”。最后,我删除了以“国家/地区”开头的列。
library(XML)
library(RCurl)
library(rlist)
library(tidyverse)
theurl <-
getURL("https://www.iban.com/country-codes",
.opts = list(ssl.verifypeer = FALSE))
countries <- readHTMLTable(theurl)
countries <-
list.clean(countries, fun = is.null, recursive = FALSE)[[1]]
df %>%
left_join(.,
countries %>% select(Country, `Alpha-2 code`),
by = c("Destination" = "Alpha-2 code")) %>%
left_join(.,
countries %>% select(Country, `Alpha-3 code`),
by = c("Destination" = "Alpha-3 code")) %>%
mutate(
Country = coalesce(Country.x, Country.y),
Destination = case_when(!is.na(Country) ~ Country,
Destination == "North America" ~ "United States-Canada-Mexico",
TRUE ~ Destination
)) %>%
select(-c(starts_with("Country")))
输出
Destination durationMinutes charge connectionCharge
1 France 6.57 0.00 0
2 France 3.34 1.94 0
3 United States 234.40 3.00 0
4 United States-Canada-Mexico 23.40 2.00 0
但是,如果您有很多不同的变体,那么您可能只想创建一个带有替换的简单数据框,这样您就可以只做一个 left_join
.
另一种选择是也添加大陆列,您可以从 countrycode
.
library(countrycode)
countrycode(sourcevar = df$Destination,
origin = "country.name",
destination = "continent")
[1] NA "Europe" "Americas" NA
数据
df <- structure(list(Destination = c("FR", "France", "United States",
"North America"), durationMinutes = c(6.57, 3.34, 234.4, 23.4
), charge = c(0, 1.94, 3, 2), connectionCharge = c(0, 0, 0, 0
)), class = "data.frame", row.names = c(NA, -4L))
{countrycode}
包可以轻松处理自定义 names/codes...
library(tidyverse)
library(countrycode)
PSTNRecords <- tibble::tribble(
~Destination, ~durationMinutes, ~charge, ~connectionCharge,
"FR", 1, 2.5, 0.3,
"France", 1, 2.5, 0.3,
"United States", 1, 2.5, 0.3,
"USA", 1, 2.5, 0.3,
"North America", 1, 2.5, 0.3
)
# see what special codes/country names you have to deal with
iso3cs <- countrycode(PSTNRecords$Destination, "country.name", "iso3c", warn = FALSE)
unique(PSTNRecords$Destination[is.na(iso3cs)])
#> [1] "FR" "North America"
# decde how to deal with them
custom_matches <- c("FR" = "FRA", "North America" = "USA")
# use your custom codes
PSTNRecords %>%
mutate(iso3c = countrycode(Destination, "country.name", "iso3c", custom_match = custom_matches))
#> # A tibble: 5 × 5
#> Destination durationMinutes charge connectionCharge iso3c
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 FR 1 2.5 0.3 FRA
#> 2 France 1 2.5 0.3 FRA
#> 3 United States 1 2.5 0.3 USA
#> 4 USA 1 2.5 0.3 USA
#> 5 North America 1 2.5 0.3 USA