在数据框中每月计算观察值

Counting observations per month in a data frame

我目前有一个包含两列的数据框:逮捕日期和逮捕人数。日期列几乎包含 2006 年至 2020 年的每一天;我不想知道每天的逮捕人数,而是每个月、每年的逮捕人数。

数据框将被转换为 xts 对象以进行时间序列分析,因此我需要一个包含年份和月份的结果日期列。

以下是数据集前 6 个月的数据:

    structure(list(ARREST_DATE = structure(c(13149, 13150, 13151, 
13152, 13153, 13154, 13155, 13156, 13157, 13158, 13159, 13160, 
13161, 13162, 13163, 13164, 13165, 13166, 13167, 13168, 13169, 
13170, 13171, 13172, 13173, 13174, 13175, 13176, 13177, 13178, 
13179, 13180, 13181, 13182, 13183, 13184, 13185, 13186, 13187, 
13188, 13189, 13190, 13191, 13192, 13193, 13194, 13195, 13196, 
13197, 13198, 13199, 13200, 13201, 13202, 13203, 13204, 13205, 
13206, 13207, 13208, 13209, 13210, 13211, 13212, 13213, 13214, 
13215, 13216, 13217, 13218, 13219, 13220, 13221, 13222, 13223, 
13224, 13225, 13226, 13227, 13228, 13229, 13230, 13231, 13232, 
13233, 13234, 13235, 13236, 13237, 13238, 13239, 13240, 13241, 
13242, 13243, 13244, 13245, 13246, 13247, 13248, 13249, 13250, 
13251, 13252, 13253, 13254, 13255, 13256, 13257, 13258, 13259, 
13260, 13261, 13262, 13263, 13264, 13265, 13266, 13267, 13268, 
13269, 13270, 13271, 13272, 13273, 13274, 13275, 13276, 13277, 
13278, 13279, 13280, 13281, 13282, 13283, 13284, 13285, 13286, 
13287, 13288, 13289, 13290, 13291, 13292, 13293, 13294, 13295, 
13296, 13297, 13298, 13299, 13300, 13301, 13302, 13303, 13304, 
13305, 13306, 13307, 13308, 13309, 13310, 13311, 13312, 13313, 
13314, 13315, 13316, 13317, 13318, 13319, 13320, 13321, 13322, 
13323, 13324, 13325, 13326, 13327, 13328, 13329), class = "Date"), 
    num_of_arrests = c(550L, 617L, 895L, 1224L, 1379L, 1246L, 
    893L, 635L, 889L, 1316L, 1223L, 1264L, 1258L, 852L, 478L, 
    710L, 1131L, 1190L, 1309L, 1085L, 910L, 704L, 852L, 1278L, 
    1322L, 1250L, 1128L, 967L, 686L, 812L, 998L, 1350L, 1356L, 
    1292L, 1006L, 568L, 867L, 1296L, 1428L, 1327L, 1182L, 821L, 
    233L, 618L, 915L, 1370L, 1391L, 1237L, 992L, 649L, 888L, 
    1167L, 1369L, 1126L, 1071L, 888L, 615L, 831L, 1019L, 1364L, 
    1109L, 1239L, 962L, 720L, 930L, 1233L, 1413L, 1350L, 1258L, 
    1034L, 629L, 954L, 1181L, 1421L, 1332L, 974L, 924L, 680L, 
    958L, 1232L, 1389L, 1289L, 1189L, 931L, 672L, 824L, 1188L, 
    1332L, 1194L, 1005L, 1011L, 653L, 822L, 1252L, 1421L, 1316L, 
    1231L, 902L, 740L, 811L, 1184L, 1362L, 1401L, 1144L, 860L, 
    383L, 775L, 1143L, 1296L, 1271L, 1056L, 729L, 593L, 836L, 
    1264L, 1341L, 1298L, 1127L, 771L, 548L, 908L, 1290L, 1398L, 
    1297L, 1127L, 878L, 663L, 928L, 1258L, 1389L, 1300L, 1135L, 
    937L, 600L, 851L, 1173L, 1366L, 1211L, 958L, 912L, 602L, 
    843L, 1274L, 1368L, 1332L, 1068L, 823L, 589L, 482L, 1076L, 
    1217L, 1194L, 1020L, 822L, 628L, 895L, 1225L, 1116L, 1264L, 
    1254L, 829L, 747L, 911L, 1241L, 1291L, 1267L, 1182L, 924L, 
    438L, 826L, 1228L, 1361L, 1255L, 1095L, 763L, 594L, 860L, 
    1056L, 1157L, 1073L, 898L)), row.names = c(NA, 181L), class = "data.frame")

要获取每个月的逮捕次数,您可以按如下方式进行:使用 lubridate 函数 month()year() 提取月份和年份,将它们分组(年份可以在您的示例中被省略,因为只有 2006 年) 和 summarize() sum().

根据要求,要获得包含年份和月份的列,paste() 将它们放在一起,ungroup(),取消选择辅助列并将 relocate() 年月放在前面。

代码

library(dplyr)
library(lubridate)

result <- data %>% mutate(year = year(ARREST_DATE), month = month(ARREST_DATE)) %>% 
  group_by(year, month) %>% summarise(arrests_per_month = sum(num_of_arrests)) %>% 
  mutate(yearmonth = paste(year, month, sep = "-")) %>%  ungroup() %>% 
  select(-c(year, month)) %>% relocate(yearmonth)

输出

> result
# A tibble: 6 x 2
  yearmonth arrests_per_month
  <chr>                 <int>
1 2006-1                31051
2 2006-2                28872
3 2006-3                33910
4 2006-4                30541
5 2006-5                32253
6 2006-6                30414