总结将两列划分为百分比

Question

我无法使用 Airplane Strikes 数据集找到在迁徙季节被杀死的加拿大鹅的百分比。

#airline stats table
airlines <- sd4 %>% 
group_by(STATE) %>% 
filter(SPECIES == "Canada goose" & total_kills > 1) %>% 
mutate(fall_mig_kills = ifelse(SPECIES=="Canada goose" & INCIDENT_MONTH %in% c(9,10,11),total_kills,0)) %>% 
summarise(
pct_mig_kills = fall_mig_kills/total_kills
) %>% 
select(STATE,SPECIES,INCIDENT_MONTH,total_kills,fall_mig_kills,pct_mig_kills)`

这是我收到错误的地方： summarise( pct_mig_kills = fall_mig_kills/total_kills )

错误是：

Error in summarise_impl(.data, dots) : 
Column `pct_mig_kills` must be length 1 (a summary value), not 10

不确定在划分两个整数列时我如何得到一个长于长度 1 的值。

如有任何帮助，我们将不胜感激！

本杰明

Answer 1

让我们阅读数据，记录所有内容，看看您的错误出现在哪里。

一般来说，您应该有一个 link 到您的原始数据集，或者提供一个缩短的版本以遵循 reproducibility principle. I found an aircraft wildlife strikes, 1990-2015 dataset on Kaggle, which I will be using here. Note: You will need to have a Kaggle account to download the data. It may also be available at data.gov。

读入数据

library(dplyr)
df <- read.csv("~/../Downloads/database.csv", stringsAsFactors = F)
> df$Species.Name[grepl("Canada goose", df$Species.Name, ignore.case = T)][1]
[1] "CANADA GOOSE"

> names(df)
 [1] "Record.ID"            "Incident.Year"        "Incident.Month"      
 [4] "Incident.Day"         "Operator.ID"          "Operator"            
 [7] "Aircraft"             "Aircraft.Type"        "Aircraft.Make"       
[10] "Aircraft.Model"       "Aircraft.Mass"        "Engine.Make"         
[13] "Engine.Model"         "Engines"              "Engine.Type"         
[16] "Engine1.Position"     "Engine2.Position"     "Engine3.Position"    
[19] "Engine4.Position"     "Airport.ID"           "Airport"             
[22] "State"                "FAA.Region"           "Warning.Issued"      
[25] "Flight.Phase"         "Visibility"           "Precipitation"       
[28] "Height"               "Speed"                "Distance"            
[31] "Species.ID"           "Species.Name"         "Species.Quantity"    
[34] "Flight.Impact"        "Fatalities"           "Injuries"            
[37] "Aircraft.Damage"      "Radome.Strike"        "Radome.Damage"       
[40] "Windshield.Strike"    "Windshield.Damage"    "Nose.Strike"         
[43] "Nose.Damage"          "Engine1.Strike"       "Engine1.Damage"      
[46] "Engine2.Strike"       "Engine2.Damage"       "Engine3.Strike"      
[49] "Engine3.Damage"       "Engine4.Strike"       "Engine4.Damage"      
[52] "Engine.Ingested"      "Propeller.Strike"     "Propeller.Damage"    
[55] "Wing.or.Rotor.Strike" "Wing.or.Rotor.Damage" "Fuselage.Strike"     
[58] "Fuselage.Damage"      "Landing.Gear.Strike"  "Landing.Gear.Damage" 
[61] "Tail.Strike"          "Tail.Damage"          "Lights.Strike"       
[64] "Lights.Damage"        "Other.Strike"         "Other.Damage"        
[67] "totalKills"

请注意物种名称全部为大写字母。使用 grepl 而不是 == 除非你确定你知道逐字的名字。

没有 total_kills 变量，Fatalities 变量表示人类死亡人数，因此我将忽略该过滤器变量。我确实找到了 Species.Quantity，这可能就是您要查找的，在一次事件中死亡的物种总数。

> unique(df$Species.Quantity)
[1] "1"        "2-10"     ""         "11-100"   "Over 100"

对于这个例子，我们可以将这些值转换为数字。

> dictNames <- unique(df$Species.Quantity)
> dict <- c(1, 2, 0, 11, 100)
> names(dict) <- dictNames
> dict['1']
1 
1 
> dict['2-10']
2-10 
   2 
> df <- df %>% mutate(totalKills = dict[Species.Quantity])
> table(df$totalKills, useNA = "always")

     1      2     11    100   <NA> 
146563  21852   1166     46   4477

很好，现在让我们看看您的代码。

尝试您的代码并找到问题

> df %>% 
+   group_by(State) %>% 
+   filter(Species.Name == "CANADA GOOSE" & totalKills > 1) %>% 
+   mutate(fall_mig_kills = ifelse(Species.Name == "CANADA GOOSE" & 
+                                    Incident.Month %in% c(9,10,11),
+                                  totalKills,
+                                  0)
+          ) %>% 
+   summarise(
+     pct_mig_kills = fall_mig_kills/totalKills
+   )
Error in summarise_impl(.data, dots) : 
  Column `pct_mig_kills` must be length 1 (a summary value), not 19

嗯，让我们看看为什么会这样。通过在控制台中输入 ?summarise 阅读帮助菜单说：

summarise {dplyr} R Documentation Reduces multiple values down to a single value

Description

summarise() is typically used on grouped data created by group_by(). The output will have one row for each group.

好的，所以每个组的输出将有一行。因为你已经对一个变量进行了分组，我们需要总和总击杀数。此外，您可能想要创建一个新变量 "inSeason"，这将使您能够适当地汇总数据。

因此，要解决您的问题，您只需添加 sum:

+   summarise(
+     pct_mig_kills = sum(fall_mig_kills)/sum(totalKills)
+   )
# A tibble: 49 x 2
   State pct_mig_kills
   <chr>         <dbl>
 1          0.70212766
 2    AK    0.50000000
 3    AL    0.00000000
 4    AR    1.00000000
 5    CA    0.06185567

无错重写代码

现在让我们将其更改为更容易阅读。你关心的是赛季，而不是州。

> df %>%
+   # inSeason = seasons we care about monitoring
+   # totalKills has NA values, we choose to put deaths at 0
+   mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+          totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+   # canadian geese only
+   filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+   # collect data by inSeason
+   group_by(inSeason) %>%
+   # sum them up
+   summarise(totalDead = sum(totalKills)) %>%
+   # add a ratio value
+   mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 2 x 3
  inSeason totalDead percentDead
     <chr>     <dbl>       <dbl>
1       in       838          34
2      out      1620          66

现在你有季节性与淡季、总死亡人数和百分比。如果您想添加状态，请将该变量添加到您的分组中。

另一个注意事项，group_by 和 summarise 会自动删除其他列，因此您不需要在末尾使用 select。

> df %>%
+   mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+          totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+   filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+   group_by(State, inSeason) %>%
+   summarise(totalDead = sum(totalKills)) %>%
+   mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 98 x 4
# Groups:   State [51]
   State inSeason totalDead percentDead
   <chr>    <chr>     <dbl>       <dbl>
 1             in        52          52
 2            out        48          48
 3    AB       in         1          50
 4    AB      out         1          50
 5    AK       in        13          33
 6    AK      out        26          67
 7    AL       in         2          40
 8    AL      out         3          60
 9    AR       in         6         100
10    CA       in        13           8

总结将两列划分为百分比

summarise divide two columns as percent

r

divide

dplyr

summarize

读入数据

尝试您的代码并找到问题

无错重写代码