如何在 R 中概括大量分类变量？

Question

我在 R 中有以下 df:

ID      GENDER        COUNTRY
1         M             US
2         M             UK
3         F             JPN
4         F             NED

有 50 多个不同的国家，我想将此信息总结如下。如果此人来自前 10 个最受欢迎的国家（热门国家是记录最多的国家），COUNTRY_POPULAR 将为 1，否则为 0。Ex US 和 UK 恰好在这个 df 中的前 10 个频繁JPN 和 NED 不是：

ID      GENDER        COUNTRY         COUNTRY_POPULAR 
1         M             US                   1
2         M             UK                   1
3         F             JPN                  0
4         F             NED                  0

Answer 1

假设 'popular' 在数据库中出现次数最多，一种方法是创建一个具有以下等级的临时列：

假设您的 data.frame 被称为 df:

# get the count of how many times a country is mentioned 
df <- plyr::ddply(df, .(COUNTRY), mutate, rank = n())

# create the popular column
df$COUNTRY_POPULAR <- ifelse(df$rank <= 10, 1, 0)

然后您可以删除排名列。

编辑：

无需 summarise 然后 merge，您可以 mutate 代替。

Answer 2

在 base R 中，我们可以使用 table 来计算每个 country、sort 它们 select 使用 tail 的前 10 个国家/地区的出现次数并根据它们的 presence/absence 分配 1/0 值。

df$COUNTRY_POPULAR <- +(df$COUNTRY %in% names(tail(sort(table(df$COUNTRY)), 10)))

前面的+将逻辑值TRUE/FALSE分别转换为1/0。

Answer 3

如果你想使用 dplyr 来做到这一点，这是另一个选项，它允许你处理可能与排名相关的国家：

library(dplyr)

# Get the top 10 countries (count allows you to untie countries which might have the same position, so an addition to the answer).
top_10 <-
  df %>%
  count(COUNTRY, sort = TRUE) %>%
  slice(1:10) %>%
  pull(COUNTRY)


# If the country is in the top 10, assign a 1 otherwise a 0.
df %>%
  mutate(COUNTRY_POPULAR = if_else(COUNTRY %in% top_10, 1, 0))

如何在 R 中概括大量分类变量？

How can I generalize a lot of categorical variable in R?

r

dataframe

dplyr

feature-engineering