试图用两条线制作一个ggplot

trying to make a ggplot with two lines

数据可在以下位置找到:https://www.kaggle.com/tovarischsukhov/southparklines

SP = read.csv("/Users/michael/Desktop/stat 479 proj data/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)

Clean.Boys = SP  %>% select(Season, Episode, Character) %>% 
  arrange(Season, Episode, Character) %>% 
  filter(Character == "Kenny"   | Character == "Cartman") %>% 
  group_by(Season, Episode) 

count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)

Season Episode Cartman Kenny
  <fct>  <fct>     <int> <int>
1 1      1            85     5
2 2      1             1     0
3 3      1            43    19
4 4      1            83     6
5 5      1            37     3
6 6      1            67     0

我正在尝试使用 ggplot 制作一个图,上面有 2 条线,一条用于 Cartman 变量,一条用于 Kenny 变量。我的两个问题是

  1. 我的数据格式是否正确,可以使用 geom_line() 绘制图表?还是我必须将其旋转更长的时间?

  2. 我想将 X 尺度绘制为连续变量,类似于日期,但它是季节和剧集。例如,第一个绘图点是第 1 季第 1 集,然后是第 1 季第 2 集,依此类推。我对如何将季节和剧集放在不同的列中做到这一点感到困惑,即使我将它们组合在一起,我也不确定正确的格式是什么。

诀窍是收集要映射的列作为变量。我不知道,你想如何绘制你的图表,意思是,关于 x-axis 和 y-axis,我做了一个伪图。对于连续变量部分,您可以使用 as.integer()as.numeric() 将值转换为整数或数字,然后可以用作连续刻度。您可以通过调用 str(df) 检查您的变量结构,这将显示您变量的 class,如果它是因子或字符,将它们转换为数字。

#libraries
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.0.3

#your code
SP <- read.csv("C:/Users/saura/Desktop/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
#> Warning: NAs introduced by coercion
SP$Episode = as.numeric(SP$Episode)
#> Warning: NAs introduced by coercion

Clean.Boys = SP  %>% select(Season, Episode, Character) %>% 
  arrange(Season, Episode, Character) %>% 
  filter(Character == "Kenny"   | Character == "Cartman") %>% 
  group_by(Season, Episode) 
count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)

#here is your code, but as I dont know, what you want on your axis
new_df <- Clean %>%
  gather(-Season,-Episode, key = "Views", value = "numbers")

ggplot(data = new_df, aes(
  as.numeric(Episode),
  numbers,
  color = Views,
  group = Views
)) +
  geom_path()

reprex package (v2.0.1)

于 2022-02-19 创建

在此示例中,我使用 readr::read_csv 读取文件并在调用中设置变量类型,以保存在单独的代码行中执行此操作。

频率计数可以在管道工作流中使用 dplyr::summarise 完成。

我不确定您想要将季节和剧集数据保留为连续变量的真正意思 - 您必须更明确地说明您希望它看起来如何。我采用的方法是提供一种使用最少文本显示季节和剧集的方法: 默认情况下,季节和剧集的顺序是数字顺序,但当它们组合成一个角色时,必须使用 factor 将它们强制转换为数字顺序。另一种方法是按季节分面。

ggplot喜欢长格式的数据,所以不需要将数据转换成宽格式。

为了保持图表的可读性,仅显示前 80 个观察值。

library(readr)
library(dplyr)
library(ggplot2

SP <- read_csv("...your file path.../All-seasons.csv"col_types = "nncc")

Clean.Boys <- 
  SP  %>% 
  select(-Line) %>% 
  arrange(Season, Episode, Character) %>% 
  filter(Character == "Kenny"  | Character == "Cartman") %>% 
  group_by(Season, Episode, Character)%>% 
  summarise(count = n(), .groups = "keep") %>%
  mutate(x_lab = factor(paste(Season, Episode, sep = "\n"))) %>% 
  head(n = 80)

ggplot(Clean.Boys)+
  geom_line(aes(x_lab, count, group = Character, colour = Character))+
  labs(x = "Season and episode")

reprex package (v2.0.1)

于 2022-02-20 创建