如何绘制一组直方图?
How to plot a histogram from a group?
我正在使用 R 工作室。发生的事情是我有一个数据集,其中有 1000k 数据。我有所有名为 FINAL_CLASSIFICATION 和 AGE 的列。在 FINAL_RANKING 列中有数据范围从 1 到 7。在这一列中,我们说具有 1、2 或 3 的人感染了 SARS_COVID,而在具有 4、5、6 和7人身体健康。我需要制作一个感染者年龄的直方图,为此我明白我必须做一个组来查看与 CLASIFICACION_FINAL 列的 1、2 和 3 一致的年龄,这些年龄将是受感染的人,我需要从那里制作直方图,但我找不到创建组或获取该组的方法。
你能帮帮我吗?
我有以下代码
#1)
# import the data into R
# RECOMMENDATION: use read_csv
covid_dataset <- read_csv("Desktop/Course in R/Examples/covid_dataset.csv")
View(covid_dataset)
#------------------------------------------------------------------------------------------
#2) Extract a random sample of 100k records and assign it into a new variable. From now on work with this dataset
# HINT: use dplyr's sample_n function
sample <- sample_n(covid_dataset, 100000)
# With the function sample_n what we get is a syntax sample_n(x,n) where we have that
#x will be our dataset from where we want to extract the sample and n is the sample size
#that we want
nrow(sample)
#with this function we can corroborate that we have extracted a 100K sample.
#------------------------------------------------------------------------------------------
#3)Make a statistical summary of the dataset and also show the data types by column.
summary(sample)
#The summary function is the one that gives us the summary statistics.
map(sample, class)
#The map() function gives us the data type by columns and we can see that there are
#more numeric data type.
#-------------------------------------------------------------------------------------------
#4)Filter the rows that are positive for SARS-COVID and calculate the number of records.
## Positive cases are those that in the FINAL_CLASSIFICATION column have 1, 2 or 3.
## To filter the rows, we will make use of the PIPE operator and the select function of dplyr.
#This will help us to select the column and to be able to filter the rows where
#the FINAL_CLASSIFICATION column is 1, 2 or 3, i.e. SARS-COVID positive results.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 1
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 2
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 3
# I do them separately to have a better view of the records.
#Now if we want to get them all together we simply do the following
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3)
#This gives us the rows less than or equal to 3, which is the same as giving us the rows in which the
#Rows where the FINAL_RANKING column has 1, 2 or 3.
#Now, if we want the number of records, doing it separately, we simply add
#another PIPE operator in which we will add the nrow() function to give me the number of #rows for each record.
#rows for each record.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) %>% nrow()
#gives us a result of 1471
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2) %>% nrow()
#gives us a result of 46
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3) %>% nrow()
#Gives us a result of 37703
#If we add the 3 results, we have that the total number of records is
1471+46+37703
#Which gives us 39220
#But it can be simplified by doing it in a straightforward way as follows
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3) %>% nrow()
#And we notice that we get the same result as the previous code.
#In conclusion, we have a total of 39220 positive SARS-COVID cases.
#---------------------------------------------------------------------------------------------
#5)Count the number of null records per column (HINT: Use sapply or map, and is.na)
apply(sample, MARGIN = 2, function(x)sum(is.na(x))))
#This shows us the number of NA's per column. We notice that the only column
#that has NA's is the DATE_DEF with a total of 95044, this tells us that out of the
#100K data, only approximately 5k data are known for DATE_DEF.
#------------------------------------------------------------------------------------------
#6)
##a)Calculate the mean age of covid infectees.
##b)Make a histogram of the ages of the infected persons.
##c)Make a density plot of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION
group_by(FINAL_CLASSIFICATION <= 3 ) %>% %>%
summarise(average = mean(AGE))
#Then the total average number of infected is 43.9
#Now we make a histogram of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION <=3, AGE) %>% summarise(count = n())
我有疑问的是最后一部分。我想找到受感染者的平均年龄,我使用了我用 group_by 放在那里的代码,但我不知道这是否正确。我的疑虑已经出现在#6 中的其他两个问题上,我想知道直方图以及如何绘制它们。
我收集到的是您希望 1. 根据 'FINAL_RANKING,' 的值创建一个变量 'FINAL_CLASSIFICATION' 2. 总结 FINAL_CLASSIFICATION 中组的平均年龄,以及 3.在 FINAL_CLASSIFICATION
中创建阳性案例的直方图
我创建了一个包含 100 个案例的随机样本,随机假设 AGE 和 FINAL_RANKING
library(dplyr)
library(ggplot2)
sample <- tibble(FINAL_RANKING = sample(1:7, 100, replace = T), AGE = sample(10:100, 100, replace = T) )
sample <- sample %>%
mutate(
FINAL_CLASSIFICATION = case_when(
FINAL_RANKING %in% 1:3 ~ "SARS_COVID_POSITIVE",
FINAL_RANKING %in% 4:7 ~ "SARS_COVID_NEGATIVE")
)
sample %>%
group_by(FINAL_CLASSIFICATION) %>%
summarize(average_age = mean(AGE))
sample %>%
filter(FINAL_CLASSIFICATION == "SARS_COVID_POSITIVE") %>%
ggplot(., aes(x = AGE)) +
geom_histogram()
给出摘要输出:
# A tibble: 2 x 2
FINAL_CLASSIFICATION average_age
<chr> <dbl>
1 SARS_COVID_NEGATIVE 51.8
2 SARS_COVID_POSITIVE 58.6
和情节:
如输出中所述,您应该调整 bins
我正在使用 R 工作室。发生的事情是我有一个数据集,其中有 1000k 数据。我有所有名为 FINAL_CLASSIFICATION 和 AGE 的列。在 FINAL_RANKING 列中有数据范围从 1 到 7。在这一列中,我们说具有 1、2 或 3 的人感染了 SARS_COVID,而在具有 4、5、6 和7人身体健康。我需要制作一个感染者年龄的直方图,为此我明白我必须做一个组来查看与 CLASIFICACION_FINAL 列的 1、2 和 3 一致的年龄,这些年龄将是受感染的人,我需要从那里制作直方图,但我找不到创建组或获取该组的方法。
你能帮帮我吗?
我有以下代码
#1)
# import the data into R
# RECOMMENDATION: use read_csv
covid_dataset <- read_csv("Desktop/Course in R/Examples/covid_dataset.csv")
View(covid_dataset)
#------------------------------------------------------------------------------------------
#2) Extract a random sample of 100k records and assign it into a new variable. From now on work with this dataset
# HINT: use dplyr's sample_n function
sample <- sample_n(covid_dataset, 100000)
# With the function sample_n what we get is a syntax sample_n(x,n) where we have that
#x will be our dataset from where we want to extract the sample and n is the sample size
#that we want
nrow(sample)
#with this function we can corroborate that we have extracted a 100K sample.
#------------------------------------------------------------------------------------------
#3)Make a statistical summary of the dataset and also show the data types by column.
summary(sample)
#The summary function is the one that gives us the summary statistics.
map(sample, class)
#The map() function gives us the data type by columns and we can see that there are
#more numeric data type.
#-------------------------------------------------------------------------------------------
#4)Filter the rows that are positive for SARS-COVID and calculate the number of records.
## Positive cases are those that in the FINAL_CLASSIFICATION column have 1, 2 or 3.
## To filter the rows, we will make use of the PIPE operator and the select function of dplyr.
#This will help us to select the column and to be able to filter the rows where
#the FINAL_CLASSIFICATION column is 1, 2 or 3, i.e. SARS-COVID positive results.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 1
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 2
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 3
# I do them separately to have a better view of the records.
#Now if we want to get them all together we simply do the following
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3)
#This gives us the rows less than or equal to 3, which is the same as giving us the rows in which the
#Rows where the FINAL_RANKING column has 1, 2 or 3.
#Now, if we want the number of records, doing it separately, we simply add
#another PIPE operator in which we will add the nrow() function to give me the number of #rows for each record.
#rows for each record.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) %>% nrow()
#gives us a result of 1471
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2) %>% nrow()
#gives us a result of 46
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3) %>% nrow()
#Gives us a result of 37703
#If we add the 3 results, we have that the total number of records is
1471+46+37703
#Which gives us 39220
#But it can be simplified by doing it in a straightforward way as follows
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3) %>% nrow()
#And we notice that we get the same result as the previous code.
#In conclusion, we have a total of 39220 positive SARS-COVID cases.
#---------------------------------------------------------------------------------------------
#5)Count the number of null records per column (HINT: Use sapply or map, and is.na)
apply(sample, MARGIN = 2, function(x)sum(is.na(x))))
#This shows us the number of NA's per column. We notice that the only column
#that has NA's is the DATE_DEF with a total of 95044, this tells us that out of the
#100K data, only approximately 5k data are known for DATE_DEF.
#------------------------------------------------------------------------------------------
#6)
##a)Calculate the mean age of covid infectees.
##b)Make a histogram of the ages of the infected persons.
##c)Make a density plot of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION
group_by(FINAL_CLASSIFICATION <= 3 ) %>% %>%
summarise(average = mean(AGE))
#Then the total average number of infected is 43.9
#Now we make a histogram of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION <=3, AGE) %>% summarise(count = n())
我有疑问的是最后一部分。我想找到受感染者的平均年龄,我使用了我用 group_by 放在那里的代码,但我不知道这是否正确。我的疑虑已经出现在#6 中的其他两个问题上,我想知道直方图以及如何绘制它们。
我收集到的是您希望 1. 根据 'FINAL_RANKING,' 的值创建一个变量 'FINAL_CLASSIFICATION' 2. 总结 FINAL_CLASSIFICATION 中组的平均年龄,以及 3.在 FINAL_CLASSIFICATION
中创建阳性案例的直方图我创建了一个包含 100 个案例的随机样本,随机假设 AGE 和 FINAL_RANKING
library(dplyr)
library(ggplot2)
sample <- tibble(FINAL_RANKING = sample(1:7, 100, replace = T), AGE = sample(10:100, 100, replace = T) )
sample <- sample %>%
mutate(
FINAL_CLASSIFICATION = case_when(
FINAL_RANKING %in% 1:3 ~ "SARS_COVID_POSITIVE",
FINAL_RANKING %in% 4:7 ~ "SARS_COVID_NEGATIVE")
)
sample %>%
group_by(FINAL_CLASSIFICATION) %>%
summarize(average_age = mean(AGE))
sample %>%
filter(FINAL_CLASSIFICATION == "SARS_COVID_POSITIVE") %>%
ggplot(., aes(x = AGE)) +
geom_histogram()
给出摘要输出:
# A tibble: 2 x 2
FINAL_CLASSIFICATION average_age
<chr> <dbl>
1 SARS_COVID_NEGATIVE 51.8
2 SARS_COVID_POSITIVE 58.6
和情节:
如输出中所述,您应该调整 bins