如何绘制一组直方图?

How to plot a histogram from a group?

我正在使用 R 工作室。发生的事情是我有一个数据集,其中有 1000k 数据。我有所有名为 FINAL_CLASSIFICATION 和 AGE 的列。在 FINAL_RANKING 列中有数据范围从 1 到 7。在这一列中,我们说具有 1、2 或 3 的人感染了 SARS_COVID,而在具有 4、5、6 和7人身体健康。我需要制作一个感染者年龄的直方图,为此我明白我必须做一个组来查看与 CLASIFICACION_FINAL 列的 1、2 和 3 一致的年龄,这些年龄将是受感染的人,我需要从那里制作直方图,但我找不到创建组或获取该组的方法。

你能帮帮我吗?

我有以下代码

#1) 
# import the data into R
# RECOMMENDATION: use read_csv

covid_dataset <- read_csv("Desktop/Course in R/Examples/covid_dataset.csv")
View(covid_dataset)


#------------------------------------------------------------------------------------------


#2) Extract a random sample of 100k records and assign it into a new variable. From now on work with this dataset
# HINT: use dplyr's sample_n function

sample <- sample_n(covid_dataset, 100000)

# With the function sample_n what we get is a syntax sample_n(x,n) where we have that
#x will be our dataset from where we want to extract the sample and n is the sample size
#that we want

nrow(sample)

#with this function we can corroborate that we have extracted a 100K sample.


#------------------------------------------------------------------------------------------


#3)Make a statistical summary of the dataset and also show the data types by column.

summary(sample)

#The summary function is the one that gives us the summary statistics. 

map(sample, class)

#The map() function gives us the data type by columns and we can see that there are
#more numeric data type.

#-------------------------------------------------------------------------------------------

#4)Filter the rows that are positive for SARS-COVID and calculate the number of records.
## Positive cases are those that in the FINAL_CLASSIFICATION column have 1, 2 or 3.


## To filter the rows, we will make use of the PIPE operator and the select function of dplyr.
#This will help us to select the column and to be able to filter the rows where
#the FINAL_CLASSIFICATION column is 1, 2 or 3, i.e. SARS-COVID positive results.



sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) 

# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 1

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2)

# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 2

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3)

# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 3



# I do them separately to have a better view of the records.




#Now if we want to get them all together we simply do the following

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3)

#This gives us the rows less than or equal to 3, which is the same as giving us the rows in which the
#Rows where the FINAL_RANKING column has 1, 2 or 3.


#Now, if we want the number of records, doing it separately, we simply add
#another PIPE operator in which we will add the nrow() function to give me the number of #rows for each record.
#rows for each record.

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) %>% nrow()

#gives us a result of 1471

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2) %>% nrow()

#gives us a result of 46

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3) %>% nrow()

#Gives us a result of 37703


#If we add the 3 results, we have that the total number of records is

1471+46+37703

#Which gives us 39220


#But it can be simplified by doing it in a straightforward way as follows 

sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3) %>% nrow()

#And we notice that we get the same result as the previous code. 

#In conclusion, we have a total of 39220 positive SARS-COVID cases.
#---------------------------------------------------------------------------------------------


#5)Count the number of null records per column (HINT: Use sapply or map, and is.na)


apply(sample, MARGIN = 2, function(x)sum(is.na(x))))

#This shows us the number of NA's per column. We notice that the only column
#that has NA's is the DATE_DEF with a total of 95044, this tells us that out of the
#100K data, only approximately 5k data are known for DATE_DEF.



#------------------------------------------------------------------------------------------

#6)
##a)Calculate the mean age of covid infectees.
##b)Make a histogram of the ages of the infected persons. 
##c)Make a density plot of the ages of the infected persons


sample %>% group_by(FINAL_CLASSIFICATION
  group_by(FINAL_CLASSIFICATION <= 3 ) %>% %>%
  summarise(average = mean(AGE))


#Then the total average number of infected is 43.9


#Now we make a histogram of the ages of the infected persons

sample %>% group_by(FINAL_CLASSIFICATION <=3, AGE) %>% summarise(count = n())

我有疑问的是最后一部分。我想找到受感染者的平均年龄,我使用了我用 group_by 放在那里的代码,但我不知道这是否正确。我的疑虑已经出现在#6 中的其他两个问题上,我想知道直方图以及如何绘制它们。

我收集到的是您希望 1. 根据 'FINAL_RANKING,' 的值创建一个变量 'FINAL_CLASSIFICATION' 2. 总结 FINAL_CLASSIFICATION 中组的平均年龄,以及 3.在 FINAL_CLASSIFICATION

中创建阳性案例的直方图

我创建了一个包含 100 个案例的随机样本,随机假设 AGE 和 FINAL_RANKING

library(dplyr)
library(ggplot2)

sample <- tibble(FINAL_RANKING = sample(1:7, 100, replace = T), AGE = sample(10:100, 100, replace = T) ) 

sample <- sample %>% 
    mutate(
        FINAL_CLASSIFICATION = case_when(
            FINAL_RANKING %in% 1:3 ~ "SARS_COVID_POSITIVE", 
            FINAL_RANKING %in% 4:7 ~ "SARS_COVID_NEGATIVE")
        ) 
sample %>% 
    group_by(FINAL_CLASSIFICATION) %>% 
    summarize(average_age = mean(AGE))

sample %>% 
    filter(FINAL_CLASSIFICATION == "SARS_COVID_POSITIVE") %>% 
    ggplot(., aes(x = AGE)) + 
    geom_histogram()

给出摘要输出:

# A tibble: 2 x 2
  FINAL_CLASSIFICATION average_age
  <chr>                      <dbl>
1 SARS_COVID_NEGATIVE         51.8
2 SARS_COVID_POSITIVE         58.6

和情节:

如输出中所述,您应该调整 bins