dplyr 查找特定行的长度

dplyr find length of specific row

我正在尝试查找组长度和组中每一行的排名。我使用 dplyr 作为长度和等级。

g.rank <- sample.df %>%
        group_by(retweet_id_str) %>%
        mutate(rank=row_number())
g.length <- sample.df %>%
        group_by(retweet_id_str) %>%
        summarise(length = n())
test <- merge(g.rank,g.length, by="retweet_id_str")

结果:

retweet_id_str     screen_name retweet_screen_name    tweet_created_at rank length
1    4.478178e+17     eyyupaluclu       GuneseYuruyen 2015-06-07 16:30:34    1      1
2    4.504073e+17     eyyupaluclu     melikemumcuoglu 2015-06-07 16:30:00    1      1
3    5.489578e+17       hadi_elis             dr_capa 2015-06-05 09:23:09    1      2
4    5.489578e+17      BozanHalit             dr_capa 2015-06-05 09:33:56    2      2
5    5.552862e+17    cevatdemiral           haber3com 2015-06-21 00:54:09    1      3
6    5.552862e+17    cevatdemiral           haber3com 2015-06-21 23:59:04    2      3
7    5.552862e+17    cevatdemiral           haber3com 2015-06-22 21:54:55    3      3

在我的数据集中,screen_name中的用户可以重复,所以如果一个用户在不同的组中重复,我需要计算这个;

user= [group1(lenght/rank) + group2(lenght/rank)]/ total group number
* Each unique "retweet_id_str" is a one group.

示例;

- One user-two different group: 
eyyupaluclu =[group1(1/1)+group2(1/1)]/2 = 1

我该怎么做?

提前感谢您提供的任何帮助。

PS:如果我的问题不清楚,请告诉我。我准备说清楚了。

数据样本:

sample.df <-structure(list(screen_name = c("eyyupaluclu", "eyyupaluclu", 
                                           "hadi_elis", "BozanHalit", "cevatdemiral", "cevatdemiral", "cevatdemiral", 
                                           "hadi_elis", "xtutunusx", "hadi_elis", "umutsu15", "OkanBoyner", 
                                           "BayarTun", "Ayhan34Isikli", "JindaAxin", "JindaAxin", "JindaAxin", 
                                           "JindaAxin", "OtoTeski", "b8f767a3022b4ee", "OtoTeski", "b8f767a3022b4ee", 
                                           "OtoTeski", "b8f767a3022b4ee", "OtoTeski", "OtoTeski", "OtoTeski", 
                                           "ankakusu1963", "ankakusu1963", "ankakusu1963", "cengizbayel", 
                                           "tarlaci5334", "sehven55", "cengizbayel", "ErdogduKezban", "Ayhan34Isikli", 
                                           "Ayhan34Isikli", "melekaydinkocak", "IrtegunUgur", "IrtegunUgur", 
                                           "melekaydinkocak", "AKinonuatasehir", "RTESLM", "vardar_filiz", 
                                           "IrtegunUgur", "AksemsettinMh", "glcihansnmezer", "esesli_murat", 
                                           "huseyinvarlik26", "ahmetkaraman001"), 
                           retweet_screen_name = c("GuneseYuruyen","melikemumcuoglu", "dr_capa", "dr_capa", "haber3com", "haber3com",
                                                   "haber3com", "GeorgetownDG", "19811923_", "meforum", "BasarKurtan", 
                                                   "SBELBULUT", "tuncaybayar52", "Akkadinistanbul", "medyayakurdi",
                                                   "medyayakurdi", "medyayakurdi", "medyayakurdi", "twit_komedyeni", 
                                                   "twit_komedyeni", "twit_komedyeni", "twit_komedyeni", "twit_komedyeni",
                                                   "twit_komedyeni", "twit_komedyeni", "twit_komedyeni", "twit_komedyeni", 
                                                   "mr_dogan", "mr_dogan", "mr_dogan", "memetsimsek", "DevletBaskanRTE", 
                                                   "ErkanGuven", "memetsimsek", "Akkadinistanbul", "Akkadinistanbul", 
                                                   "Akkadinistanbul", "_aliuzun", "akgencistanbul", "akgenc_kadikoy", 
                                                   "Hocazade_", "atasehirakparti", "oguz__kaya", "oguz__kaya", "AkTanitimMedya", 
                                                   "AkTanitimMedya", "AkTanitimMedya", "GencAkparti26", "GencAkparti26", 
                                                   "GencAkparti26"),
                           tweet_created_at = c("2015-06-07 16:30:34", "2015-06-07 16:30:00", "2015-06-05 09:23:09", "2015-06-05 09:33:56", 
                                                "2015-06-21 00:54:09", "2015-06-21 23:59:04", "2015-06-22 21:54:55", 
                                                "2015-05-18 23:05:59", "2015-06-03 06:17:24", "2015-05-31 13:48:10", 
                                                "2015-05-28 12:18:45", "2015-05-28 17:01:07", "2015-06-03 16:48:57", 
                                                "2015-05-09 07:19:29", "2015-05-09 07:36:41", "2015-05-09 07:36:46", 
                                                "2015-05-09 07:36:50", "2015-05-09 07:36:52", "2015-05-14 09:43:00", 
                                                "2015-06-13 05:19:03", "2015-05-14 09:42:39", "2015-06-13 05:18:48", 
                                                "2015-05-14 09:42:42", "2015-06-13 05:18:54", "2015-05-14 09:42:50", 
                                                "2015-05-14 09:42:47", "2015-05-14 09:42:53", "2015-05-17 23:06:16", 
                                                "2015-05-17 23:05:08", "2015-05-17 23:04:56", "2015-05-09 16:32:06", 
                                                "2015-05-09 17:35:28", "2015-05-08 03:14:29", "2015-05-09 16:31:50", 
                                                "2015-05-08 00:24:57", "2015-05-09 07:17:42", "2015-05-09 07:17:38", 
                                                "2015-05-16 19:29:58", "2015-05-08 07:15:22", "2015-05-08 07:15:18", 
                                                "2015-05-16 19:29:25", "2015-05-08 03:21:30", "2015-05-14 06:50:50", 
                                                "2015-05-14 06:54:07", "2015-05-08 07:14:13", "2015-05-09 17:41:35", 
                                                "2015-05-09 17:58:56", "2015-05-08 04:59:54", "2015-05-09 02:34:12", 
                                                "2015-05-10 07:38:01"),
                           retweet_id_str = c(447817783829860352, 450407343604629504, 548957776895303680, 548957776895303680, 555286212916035584, 
                                              555286212916035584, 555286212916035584, 561187125438451712, 573054097726840832, 
                                              584809040380760064, 587723931919986688, 588382883733176320, 592766311387832320, 
                                              593336106013347840, 593453258716409856, 593453420641652736, 593453598975119360, 
                                              593453994799935488, 594907386885836800, 594907386885836800, 594907487125577728, 
                                              594907487125577728, 594907617866166272, 594907617866166272, 594907731506667520, 
                                              594907807331254272, 594907964017881088, 594957981340532736, 594961968521420800, 
                                              594964791598387200, 595130224523743232, 595160203596865536, 595176402967777280, 
                                              595183002243719168, 595211840055078912, 595211840055078912, 595211943088009216, 
                                              595212869400186880, 595212974190678016, 595213691026591744, 595213757216858112, 
                                              595213790863568896, 595214683541544960, 595214683541544960, 595214727321677824, 
                                              595214727321677824, 595214727321677824, 595214737861804032, 595214737861804032, 
                                              595214737861804032)), 
                      .Names = c("screen_name", "retweet_screen_name", 
                                 "tweet_created_at", "retweet_id_str"), class = c("tbl_df", "data.frame"), 
                      row.names = c(NA, -50L))

很简单。看起来您想要每个 screen_name:

length/rank 的平均值
test %>% mutate(l_over_r = length / rank) %>%
    group_by(screen_name) %>%
    summarize(user = mean(l_over_r))

如果一个用户只在一个组中,则平均值当然是相同的值。如果用户在 2 个组中,则 group1(l_over_r) + group2(l_over_r) / 2 是平均值(均值),这很好地概括了。如果你真的只想为 exactly 两组中的用户计算这个,那么你可以预过滤:

test %>% mutate(l_over_r = length / rank) %>%
    group_by(screen_name) %>%
    filter(n() == 2) %>%
    summarize(user = mean(l_over_r))

# Source: local data frame [3 x 2]
#
#       screen_name  user
#             (chr) (dbl)
# 1     cengizbayel     1
# 2     eyyupaluclu     1
# 3 melekaydinkocak     1

作为旁注,由于您正在使用 dplyr,因此您应该养成使用 left_join 而不是 merge 的习惯,例如 test <- lef_join(g.rank, g.length, by="retweet_id_str")/