查找组合的多个字符串的频率并绘制？

Question

我有一个 df 看起来像

标签	年
北美，经济 - 商品，经济 - 支出 - 联邦	2008
欧洲，经济 - 商品，贸易历史和商品	2009

它们没有按顺序排列，每个逗号前的最后一个单词后有一个 space。

“北美”和“欧洲”将被视为主标签，其余为子标签。

我正在尝试查找主标签 + 子标签的频率，以便它：

北美，经济 - 商品	欧洲，经济 - 商品	年
1	0	2008
0	1	2009

然后在同一图表中绘制两列的频率。

根据上一个问题，我使用此代码查找单个标签的频率并绘制它们：

df <- data.frame(tags = c("North America , Economy - Goods , Economy - Spending - Federal", "Europe , Economy - Goods , Trading History & Goods"), Year = c(2008, 2009))

df1 <- df %>%
  separate_rows(tags, sep = ',\s*') %>%
  separate(tags, c('tags', 'Value'), sep = '\s*-\s*',fill = 'right') %>%
  mutate(tags = trimws(tags)) %>% 
  count(Year, tags) %>%
  pivot_wider(names_from = tags, values_from = n, values_fill = 0)

# Subset for specific tags 
    df2 <- subset(df1, select = c("Year", "North America", 
    "Europe"))

# Reshape data frame for ggplot
    df3 <- data.frame(x = df2$Year,                            
                      y = c(df2$"North America", df2$"Europe"),
                      group = c(rep("North America", nrow(df2)),
                              rep("Europe", nrow(df2))))
# Plot
ggplot(df3, aes(x, y, col = group)) +            
   geom_line()

但我不确定如何更改此代码以合并多个标签。

谢谢！

编辑：不确定这是否对任何人有帮助，但发布我使用的对我的 df 有效的解决方案：

df_1 <- df %>%
  filter(grepl("\bNorth America\b", tags)) %>%
  filter(grepl("Economy - Goods", tags)) %>%
  group_by(Year) %>% 
tally()


df_2 <- df %>%
  filter(grepl("Europe", tags)) %>%
  filter(grepl("Economy - Goods", tags)) %>%
  group_by(Year) %>% 
    tally()


# merge table
df3 <- full_join(df_1, df_2, by = "Year")
df3[is.na(df3)] <- 0

# plot
dfm <- melt(df3, id.vars = "Year")

p <- ggplot(dfm, aes(x = Year, y = value, colour = variable))
p + geom_line() +
  scale_x_continuous(limits = c(1999, 2020), breaks = c(seq(1999,    
2020, 1))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Answer 1

您的示例数据集可能太少，无法在我们的回答中发挥很大作用。此外，您提出的情节很奇怪，年份是数字。如果我的建议不是你要找的，你可能想手绘你想要的情节并post它。

不过，根据我的理解，这是您遇到的问题的一些解决方案：

library(tidyverse)
  
master_tags = c("North America", "Europe")
df1 = df %>%
  mutate(master_tag = str_extract(tags, master_tags)) %>% 
  separate_rows(tags, sep = ',\s*') %>% 
  mutate(tags = str_trim(tags)) %>% 
  filter(!tags %in% master_tags)

ggplot(df1, aes(x=master_tag, fill=tags)) + 
  geom_histogram(stat="count", position="dodge") + 
  facet_wrap("Year")

^{由 reprex package (v2.0.0)}

于 2021-06-02 创建

在这里，我们可以使用 stringr::str_extract() 提取主标签，并在适当的时候使用 dplyr::filter() 将其删除。然后，我们可以直接使用geom_histogram().

进行计数

显然，对于更大的输入数据集，此图会提供更多信息。

查找组合的多个字符串的频率并绘制？

Find frequencies of multiple strings combined and plot?

analysis

r

frequency

ggplot2

dplyr