如何按年份和绘图从一列中查找多个 ID 的频率？

Question

我有一个 df 看起来像

ID	Year
Nation, Nation - NA, Economy, Economy - Asia	2008
Economy, Economy - EU, State, Nation	2009

我想提取 ID 的频率，使其看起来像

Nation	Economy	State	Year
2	2	0	2008
1	2	1	2009

对于带有“经济 - 欧盟”等连字符的 ID，我只想将其计为“经济”的频率

我的最终目标是按年份绘制此 df，并在同一图中使用不同 ID 的频率计数。例如，“国家”在 2008 年是绿点，“国家”在 2008 年是红点，“经济”在 2008 年是蓝点。

如果第二个df不是一个好的方法，我也愿意接受建议！那只是我对如何开始这个的第一个想法。

如果这不合适，我会post作为一个单独的问题，但我的下一个问题是如何按年绘制第二个 df 的频率，如上所述？

谢谢！

Answer 1

您可以使用 separate_rows 以逗号 (,) 拆分将数据拆分为不同的行。将 - 之后的值分隔在不同的列中，并计算每个 Year 中 ID 值的出现次数，并以宽格式获取数据。

library(dplyr)
library(tidyr)

df %>%
  separate_rows(ID, sep = ',\s*') %>%
  separate(ID, c('ID', 'Value'), sep = '\s*-\s*',fill = 'right') %>%
  count(Year, ID) %>%
  pivot_wider(names_from = ID, values_from = n, values_fill = 0)

#   Year Economy Nation State
#  <int>   <int>  <int> <int>
#1  2008       2      2     0
#2  2009       2      1     1

您还可以使用 janitor::tabyl 来减少代码。

df %>%
  separate_rows(ID, sep = ',\s*') %>%
  separate(ID, c('ID', 'Value'), sep = '\s*-\s*',fill = 'right') %>%
  janitor::tabyl(Year, ID)

数据

df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia", 
"Economy, Economy - EU, State, Nation"), Year = 2008:2009), 
class = "data.frame", row.names = c(NA, -2L))

Answer 2

我们可以使用 str_count 来计算字符串并通过 Year
使用 pivot_longer 为 ggplot
对条形图使用ggplot（演示了基本版本）

library(tidyverse)

# table
df <- df %>% 
  group_by(Year) %>% 
  summarise(Nation = str_count(ID, "Nation"),
         Economy = str_count(ID, "Economy"),
         State = str_count(ID,"State"))

df
# preparation for plotting
df1 <- df %>% 
  pivot_longer(
    cols = -Year,
    names_to = "names",
    values_to = "values"
  ) 

# plot
ggplot(df1, aes(x = factor(names), y=values, fill=factor(Year), label=values)) +
  geom_col(position=position_dodge())+
  geom_text(size = 4, position =position_dodge(1),vjust=-.5)

输出：

   Year Nation Economy State
* <dbl>  <int>   <int> <int>
1  2008      2       2     0
2  2009      1       2     1

情节：

Answer 3

我觉得已经完全搞定了，但是正如你质疑的那样，你的最终目标是剧情，我觉得没必要pivot_wider

library(tidyverse)
df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia", 
                            "Economy, Economy - EU, State, Nation"), Year = 2008:2009), 
                class = "data.frame", row.names = c(NA, -2L))

df %>%
  separate_rows(ID, sep = ',\s*') %>%
  separate(ID, c('ID', 'Value'), sep = '\s*-\s*',fill = 'right') %>%
  count(Year, ID) %>%
  ggplot(aes(x= as.factor(Year), y = n, color = ID)) +
  geom_col(position = 'dodge') +
  coord_flip()

或

df %>%
  separate_rows(ID, sep = ',\s*') %>%
  separate(ID, c('ID', 'Value'), sep = '\s*-\s*',fill = 'right') %>%
  count(Year, ID) %>%
  ggplot(aes(x= as.factor(Year), y = n, color = ID, label = paste(ID, n, sep = '-'))) +
  geom_col(position = 'dodge') +
  geom_text(size = 2, position =position_dodge(0.9), vjust = -0.5)

^{由 reprex package (v2.0.0)}

于 2021-05-27 创建

如何按年份和绘图从一列中查找多个 ID 的频率？

How to find frequencies of multiple ID's from one column by year and plot?

analysis

r

frequency

ggplot2

dplyr