总结将两列划分为百分比
summarise divide two columns as percent
我无法使用 Airplane Strikes 数据集找到在迁徙季节被杀死的加拿大鹅的百分比。
#airline stats table
airlines <- sd4 %>%
group_by(STATE) %>%
filter(SPECIES == "Canada goose" & total_kills > 1) %>%
mutate(fall_mig_kills = ifelse(SPECIES=="Canada goose" & INCIDENT_MONTH %in% c(9,10,11),total_kills,0)) %>%
summarise(
pct_mig_kills = fall_mig_kills/total_kills
) %>%
select(STATE,SPECIES,INCIDENT_MONTH,total_kills,fall_mig_kills,pct_mig_kills)`
这是我收到错误的地方:
summarise(
pct_mig_kills = fall_mig_kills/total_kills
)
错误是:
Error in summarise_impl(.data, dots) :
Column `pct_mig_kills` must be length 1 (a summary value), not 10
不确定在划分两个整数列时我如何得到一个长于长度 1 的值。
如有任何帮助,我们将不胜感激!
本杰明
让我们阅读数据,记录所有内容,看看您的错误出现在哪里。
一般来说,您应该有一个 link 到您的原始数据集,或者提供一个缩短的版本以遵循 reproducibility principle. I found an aircraft wildlife strikes, 1990-2015 dataset on Kaggle, which I will be using here. Note: You will need to have a Kaggle account to download the data. It may also be available at data.gov。
读入数据
library(dplyr)
df <- read.csv("~/../Downloads/database.csv", stringsAsFactors = F)
> df$Species.Name[grepl("Canada goose", df$Species.Name, ignore.case = T)][1]
[1] "CANADA GOOSE"
> names(df)
[1] "Record.ID" "Incident.Year" "Incident.Month"
[4] "Incident.Day" "Operator.ID" "Operator"
[7] "Aircraft" "Aircraft.Type" "Aircraft.Make"
[10] "Aircraft.Model" "Aircraft.Mass" "Engine.Make"
[13] "Engine.Model" "Engines" "Engine.Type"
[16] "Engine1.Position" "Engine2.Position" "Engine3.Position"
[19] "Engine4.Position" "Airport.ID" "Airport"
[22] "State" "FAA.Region" "Warning.Issued"
[25] "Flight.Phase" "Visibility" "Precipitation"
[28] "Height" "Speed" "Distance"
[31] "Species.ID" "Species.Name" "Species.Quantity"
[34] "Flight.Impact" "Fatalities" "Injuries"
[37] "Aircraft.Damage" "Radome.Strike" "Radome.Damage"
[40] "Windshield.Strike" "Windshield.Damage" "Nose.Strike"
[43] "Nose.Damage" "Engine1.Strike" "Engine1.Damage"
[46] "Engine2.Strike" "Engine2.Damage" "Engine3.Strike"
[49] "Engine3.Damage" "Engine4.Strike" "Engine4.Damage"
[52] "Engine.Ingested" "Propeller.Strike" "Propeller.Damage"
[55] "Wing.or.Rotor.Strike" "Wing.or.Rotor.Damage" "Fuselage.Strike"
[58] "Fuselage.Damage" "Landing.Gear.Strike" "Landing.Gear.Damage"
[61] "Tail.Strike" "Tail.Damage" "Lights.Strike"
[64] "Lights.Damage" "Other.Strike" "Other.Damage"
[67] "totalKills"
请注意物种名称全部为大写字母。使用 grepl
而不是 ==
除非你确定你知道逐字的名字。
没有 total_kills
变量,Fatalities
变量表示 人类 死亡人数,因此我将忽略该过滤器变量。我确实找到了 Species.Quantity
,这可能就是您要查找的,在一次事件中死亡的物种总数。
> unique(df$Species.Quantity)
[1] "1" "2-10" "" "11-100" "Over 100"
对于这个例子,我们可以将这些值转换为数字。
> dictNames <- unique(df$Species.Quantity)
> dict <- c(1, 2, 0, 11, 100)
> names(dict) <- dictNames
> dict['1']
1
1
> dict['2-10']
2-10
2
> df <- df %>% mutate(totalKills = dict[Species.Quantity])
> table(df$totalKills, useNA = "always")
1 2 11 100 <NA>
146563 21852 1166 46 4477
很好,现在让我们看看您的代码。
尝试您的代码并找到问题
> df %>%
+ group_by(State) %>%
+ filter(Species.Name == "CANADA GOOSE" & totalKills > 1) %>%
+ mutate(fall_mig_kills = ifelse(Species.Name == "CANADA GOOSE" &
+ Incident.Month %in% c(9,10,11),
+ totalKills,
+ 0)
+ ) %>%
+ summarise(
+ pct_mig_kills = fall_mig_kills/totalKills
+ )
Error in summarise_impl(.data, dots) :
Column `pct_mig_kills` must be length 1 (a summary value), not 19
嗯,让我们看看为什么会这样。通过在控制台中输入 ?summarise
阅读帮助菜单说:
summarise {dplyr} R Documentation Reduces multiple values down to a
single value
Description
summarise() is typically used on grouped data created by group_by().
The output will have one row for each group.
好的,所以每个组的输出将有一行。因为你已经对一个变量进行了分组,我们需要总和 总击杀数。此外,您可能想要创建一个新变量 "inSeason",这将使您能够适当地汇总数据。
因此,要解决您的问题,您只需添加 sum
:
+ summarise(
+ pct_mig_kills = sum(fall_mig_kills)/sum(totalKills)
+ )
# A tibble: 49 x 2
State pct_mig_kills
<chr> <dbl>
1 0.70212766
2 AK 0.50000000
3 AL 0.00000000
4 AR 1.00000000
5 CA 0.06185567
无错重写代码
现在让我们将其更改为更容易阅读。你关心的是赛季,而不是州。
> df %>%
+ # inSeason = seasons we care about monitoring
+ # totalKills has NA values, we choose to put deaths at 0
+ mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+ totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+ # canadian geese only
+ filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+ # collect data by inSeason
+ group_by(inSeason) %>%
+ # sum them up
+ summarise(totalDead = sum(totalKills)) %>%
+ # add a ratio value
+ mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 2 x 3
inSeason totalDead percentDead
<chr> <dbl> <dbl>
1 in 838 34
2 out 1620 66
现在你有季节性与淡季、总死亡人数和百分比。如果您想添加状态,请将该变量添加到您的分组中。
另一个注意事项,group_by
和 summarise
会自动删除其他列,因此您不需要在末尾使用 select
。
> df %>%
+ mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+ totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+ filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+ group_by(State, inSeason) %>%
+ summarise(totalDead = sum(totalKills)) %>%
+ mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 98 x 4
# Groups: State [51]
State inSeason totalDead percentDead
<chr> <chr> <dbl> <dbl>
1 in 52 52
2 out 48 48
3 AB in 1 50
4 AB out 1 50
5 AK in 13 33
6 AK out 26 67
7 AL in 2 40
8 AL out 3 60
9 AR in 6 100
10 CA in 13 8
我无法使用 Airplane Strikes 数据集找到在迁徙季节被杀死的加拿大鹅的百分比。
#airline stats table
airlines <- sd4 %>%
group_by(STATE) %>%
filter(SPECIES == "Canada goose" & total_kills > 1) %>%
mutate(fall_mig_kills = ifelse(SPECIES=="Canada goose" & INCIDENT_MONTH %in% c(9,10,11),total_kills,0)) %>%
summarise(
pct_mig_kills = fall_mig_kills/total_kills
) %>%
select(STATE,SPECIES,INCIDENT_MONTH,total_kills,fall_mig_kills,pct_mig_kills)`
这是我收到错误的地方:
summarise(
pct_mig_kills = fall_mig_kills/total_kills
)
错误是:
Error in summarise_impl(.data, dots) :
Column `pct_mig_kills` must be length 1 (a summary value), not 10
不确定在划分两个整数列时我如何得到一个长于长度 1 的值。
如有任何帮助,我们将不胜感激!
本杰明
让我们阅读数据,记录所有内容,看看您的错误出现在哪里。
一般来说,您应该有一个 link 到您的原始数据集,或者提供一个缩短的版本以遵循 reproducibility principle. I found an aircraft wildlife strikes, 1990-2015 dataset on Kaggle, which I will be using here. Note: You will need to have a Kaggle account to download the data. It may also be available at data.gov。
读入数据
library(dplyr)
df <- read.csv("~/../Downloads/database.csv", stringsAsFactors = F)
> df$Species.Name[grepl("Canada goose", df$Species.Name, ignore.case = T)][1]
[1] "CANADA GOOSE"
> names(df)
[1] "Record.ID" "Incident.Year" "Incident.Month"
[4] "Incident.Day" "Operator.ID" "Operator"
[7] "Aircraft" "Aircraft.Type" "Aircraft.Make"
[10] "Aircraft.Model" "Aircraft.Mass" "Engine.Make"
[13] "Engine.Model" "Engines" "Engine.Type"
[16] "Engine1.Position" "Engine2.Position" "Engine3.Position"
[19] "Engine4.Position" "Airport.ID" "Airport"
[22] "State" "FAA.Region" "Warning.Issued"
[25] "Flight.Phase" "Visibility" "Precipitation"
[28] "Height" "Speed" "Distance"
[31] "Species.ID" "Species.Name" "Species.Quantity"
[34] "Flight.Impact" "Fatalities" "Injuries"
[37] "Aircraft.Damage" "Radome.Strike" "Radome.Damage"
[40] "Windshield.Strike" "Windshield.Damage" "Nose.Strike"
[43] "Nose.Damage" "Engine1.Strike" "Engine1.Damage"
[46] "Engine2.Strike" "Engine2.Damage" "Engine3.Strike"
[49] "Engine3.Damage" "Engine4.Strike" "Engine4.Damage"
[52] "Engine.Ingested" "Propeller.Strike" "Propeller.Damage"
[55] "Wing.or.Rotor.Strike" "Wing.or.Rotor.Damage" "Fuselage.Strike"
[58] "Fuselage.Damage" "Landing.Gear.Strike" "Landing.Gear.Damage"
[61] "Tail.Strike" "Tail.Damage" "Lights.Strike"
[64] "Lights.Damage" "Other.Strike" "Other.Damage"
[67] "totalKills"
请注意物种名称全部为大写字母。使用 grepl
而不是 ==
除非你确定你知道逐字的名字。
没有 total_kills
变量,Fatalities
变量表示 人类 死亡人数,因此我将忽略该过滤器变量。我确实找到了 Species.Quantity
,这可能就是您要查找的,在一次事件中死亡的物种总数。
> unique(df$Species.Quantity)
[1] "1" "2-10" "" "11-100" "Over 100"
对于这个例子,我们可以将这些值转换为数字。
> dictNames <- unique(df$Species.Quantity)
> dict <- c(1, 2, 0, 11, 100)
> names(dict) <- dictNames
> dict['1']
1
1
> dict['2-10']
2-10
2
> df <- df %>% mutate(totalKills = dict[Species.Quantity])
> table(df$totalKills, useNA = "always")
1 2 11 100 <NA>
146563 21852 1166 46 4477
很好,现在让我们看看您的代码。
尝试您的代码并找到问题
> df %>%
+ group_by(State) %>%
+ filter(Species.Name == "CANADA GOOSE" & totalKills > 1) %>%
+ mutate(fall_mig_kills = ifelse(Species.Name == "CANADA GOOSE" &
+ Incident.Month %in% c(9,10,11),
+ totalKills,
+ 0)
+ ) %>%
+ summarise(
+ pct_mig_kills = fall_mig_kills/totalKills
+ )
Error in summarise_impl(.data, dots) :
Column `pct_mig_kills` must be length 1 (a summary value), not 19
嗯,让我们看看为什么会这样。通过在控制台中输入 ?summarise
阅读帮助菜单说:
summarise {dplyr} R Documentation Reduces multiple values down to a single value
Description
summarise() is typically used on grouped data created by group_by(). The output will have one row for each group.
好的,所以每个组的输出将有一行。因为你已经对一个变量进行了分组,我们需要总和 总击杀数。此外,您可能想要创建一个新变量 "inSeason",这将使您能够适当地汇总数据。
因此,要解决您的问题,您只需添加 sum
:
+ summarise(
+ pct_mig_kills = sum(fall_mig_kills)/sum(totalKills)
+ )
# A tibble: 49 x 2
State pct_mig_kills
<chr> <dbl>
1 0.70212766
2 AK 0.50000000
3 AL 0.00000000
4 AR 1.00000000
5 CA 0.06185567
无错重写代码
现在让我们将其更改为更容易阅读。你关心的是赛季,而不是州。
> df %>%
+ # inSeason = seasons we care about monitoring
+ # totalKills has NA values, we choose to put deaths at 0
+ mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+ totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+ # canadian geese only
+ filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+ # collect data by inSeason
+ group_by(inSeason) %>%
+ # sum them up
+ summarise(totalDead = sum(totalKills)) %>%
+ # add a ratio value
+ mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 2 x 3
inSeason totalDead percentDead
<chr> <dbl> <dbl>
1 in 838 34
2 out 1620 66
现在你有季节性与淡季、总死亡人数和百分比。如果您想添加状态,请将该变量添加到您的分组中。
另一个注意事项,group_by
和 summarise
会自动删除其他列,因此您不需要在末尾使用 select
。
> df %>%
+ mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+ totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+ filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+ group_by(State, inSeason) %>%
+ summarise(totalDead = sum(totalKills)) %>%
+ mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 98 x 4
# Groups: State [51]
State inSeason totalDead percentDead
<chr> <chr> <dbl> <dbl>
1 in 52 52
2 out 48 48
3 AB in 1 50
4 AB out 1 50
5 AK in 13 33
6 AK out 26 67
7 AL in 2 40
8 AL out 3 60
9 AR in 6 100
10 CA in 13 8