dplyr 查找特定行的长度
dplyr find length of specific row
我正在尝试查找组长度和组中每一行的排名。我使用 dplyr 作为长度和等级。
g.rank <- sample.df %>%
group_by(retweet_id_str) %>%
mutate(rank=row_number())
g.length <- sample.df %>%
group_by(retweet_id_str) %>%
summarise(length = n())
test <- merge(g.rank,g.length, by="retweet_id_str")
结果:
retweet_id_str screen_name retweet_screen_name tweet_created_at rank length
1 4.478178e+17 eyyupaluclu GuneseYuruyen 2015-06-07 16:30:34 1 1
2 4.504073e+17 eyyupaluclu melikemumcuoglu 2015-06-07 16:30:00 1 1
3 5.489578e+17 hadi_elis dr_capa 2015-06-05 09:23:09 1 2
4 5.489578e+17 BozanHalit dr_capa 2015-06-05 09:33:56 2 2
5 5.552862e+17 cevatdemiral haber3com 2015-06-21 00:54:09 1 3
6 5.552862e+17 cevatdemiral haber3com 2015-06-21 23:59:04 2 3
7 5.552862e+17 cevatdemiral haber3com 2015-06-22 21:54:55 3 3
在我的数据集中,screen_name中的用户可以重复,所以如果一个用户在不同的组中重复,我需要计算这个;
user= [group1(lenght/rank) + group2(lenght/rank)]/ total group number
* Each unique "retweet_id_str" is a one group.
示例;
- One user-two different group:
eyyupaluclu =[group1(1/1)+group2(1/1)]/2 = 1
我该怎么做?
提前感谢您提供的任何帮助。
PS:如果我的问题不清楚,请告诉我。我准备说清楚了。
数据样本:
sample.df <-structure(list(screen_name = c("eyyupaluclu", "eyyupaluclu",
"hadi_elis", "BozanHalit", "cevatdemiral", "cevatdemiral", "cevatdemiral",
"hadi_elis", "xtutunusx", "hadi_elis", "umutsu15", "OkanBoyner",
"BayarTun", "Ayhan34Isikli", "JindaAxin", "JindaAxin", "JindaAxin",
"JindaAxin", "OtoTeski", "b8f767a3022b4ee", "OtoTeski", "b8f767a3022b4ee",
"OtoTeski", "b8f767a3022b4ee", "OtoTeski", "OtoTeski", "OtoTeski",
"ankakusu1963", "ankakusu1963", "ankakusu1963", "cengizbayel",
"tarlaci5334", "sehven55", "cengizbayel", "ErdogduKezban", "Ayhan34Isikli",
"Ayhan34Isikli", "melekaydinkocak", "IrtegunUgur", "IrtegunUgur",
"melekaydinkocak", "AKinonuatasehir", "RTESLM", "vardar_filiz",
"IrtegunUgur", "AksemsettinMh", "glcihansnmezer", "esesli_murat",
"huseyinvarlik26", "ahmetkaraman001"),
retweet_screen_name = c("GuneseYuruyen","melikemumcuoglu", "dr_capa", "dr_capa", "haber3com", "haber3com",
"haber3com", "GeorgetownDG", "19811923_", "meforum", "BasarKurtan",
"SBELBULUT", "tuncaybayar52", "Akkadinistanbul", "medyayakurdi",
"medyayakurdi", "medyayakurdi", "medyayakurdi", "twit_komedyeni",
"twit_komedyeni", "twit_komedyeni", "twit_komedyeni", "twit_komedyeni",
"twit_komedyeni", "twit_komedyeni", "twit_komedyeni", "twit_komedyeni",
"mr_dogan", "mr_dogan", "mr_dogan", "memetsimsek", "DevletBaskanRTE",
"ErkanGuven", "memetsimsek", "Akkadinistanbul", "Akkadinistanbul",
"Akkadinistanbul", "_aliuzun", "akgencistanbul", "akgenc_kadikoy",
"Hocazade_", "atasehirakparti", "oguz__kaya", "oguz__kaya", "AkTanitimMedya",
"AkTanitimMedya", "AkTanitimMedya", "GencAkparti26", "GencAkparti26",
"GencAkparti26"),
tweet_created_at = c("2015-06-07 16:30:34", "2015-06-07 16:30:00", "2015-06-05 09:23:09", "2015-06-05 09:33:56",
"2015-06-21 00:54:09", "2015-06-21 23:59:04", "2015-06-22 21:54:55",
"2015-05-18 23:05:59", "2015-06-03 06:17:24", "2015-05-31 13:48:10",
"2015-05-28 12:18:45", "2015-05-28 17:01:07", "2015-06-03 16:48:57",
"2015-05-09 07:19:29", "2015-05-09 07:36:41", "2015-05-09 07:36:46",
"2015-05-09 07:36:50", "2015-05-09 07:36:52", "2015-05-14 09:43:00",
"2015-06-13 05:19:03", "2015-05-14 09:42:39", "2015-06-13 05:18:48",
"2015-05-14 09:42:42", "2015-06-13 05:18:54", "2015-05-14 09:42:50",
"2015-05-14 09:42:47", "2015-05-14 09:42:53", "2015-05-17 23:06:16",
"2015-05-17 23:05:08", "2015-05-17 23:04:56", "2015-05-09 16:32:06",
"2015-05-09 17:35:28", "2015-05-08 03:14:29", "2015-05-09 16:31:50",
"2015-05-08 00:24:57", "2015-05-09 07:17:42", "2015-05-09 07:17:38",
"2015-05-16 19:29:58", "2015-05-08 07:15:22", "2015-05-08 07:15:18",
"2015-05-16 19:29:25", "2015-05-08 03:21:30", "2015-05-14 06:50:50",
"2015-05-14 06:54:07", "2015-05-08 07:14:13", "2015-05-09 17:41:35",
"2015-05-09 17:58:56", "2015-05-08 04:59:54", "2015-05-09 02:34:12",
"2015-05-10 07:38:01"),
retweet_id_str = c(447817783829860352, 450407343604629504, 548957776895303680, 548957776895303680, 555286212916035584,
555286212916035584, 555286212916035584, 561187125438451712, 573054097726840832,
584809040380760064, 587723931919986688, 588382883733176320, 592766311387832320,
593336106013347840, 593453258716409856, 593453420641652736, 593453598975119360,
593453994799935488, 594907386885836800, 594907386885836800, 594907487125577728,
594907487125577728, 594907617866166272, 594907617866166272, 594907731506667520,
594907807331254272, 594907964017881088, 594957981340532736, 594961968521420800,
594964791598387200, 595130224523743232, 595160203596865536, 595176402967777280,
595183002243719168, 595211840055078912, 595211840055078912, 595211943088009216,
595212869400186880, 595212974190678016, 595213691026591744, 595213757216858112,
595213790863568896, 595214683541544960, 595214683541544960, 595214727321677824,
595214727321677824, 595214727321677824, 595214737861804032, 595214737861804032,
595214737861804032)),
.Names = c("screen_name", "retweet_screen_name",
"tweet_created_at", "retweet_id_str"), class = c("tbl_df", "data.frame"),
row.names = c(NA, -50L))
很简单。看起来您想要每个 screen_name
:
的 length/rank
的平均值
test %>% mutate(l_over_r = length / rank) %>%
group_by(screen_name) %>%
summarize(user = mean(l_over_r))
如果一个用户只在一个组中,则平均值当然是相同的值。如果用户在 2 个组中,则 group1(l_over_r) + group2(l_over_r) / 2
是平均值(均值),这很好地概括了。如果你真的只想为 exactly 两组中的用户计算这个,那么你可以预过滤:
test %>% mutate(l_over_r = length / rank) %>%
group_by(screen_name) %>%
filter(n() == 2) %>%
summarize(user = mean(l_over_r))
# Source: local data frame [3 x 2]
#
# screen_name user
# (chr) (dbl)
# 1 cengizbayel 1
# 2 eyyupaluclu 1
# 3 melekaydinkocak 1
作为旁注,由于您正在使用 dplyr
,因此您应该养成使用 left_join
而不是 merge
的习惯,例如 test <- lef_join(g.rank, g.length, by="retweet_id_str")
/
我正在尝试查找组长度和组中每一行的排名。我使用 dplyr 作为长度和等级。
g.rank <- sample.df %>%
group_by(retweet_id_str) %>%
mutate(rank=row_number())
g.length <- sample.df %>%
group_by(retweet_id_str) %>%
summarise(length = n())
test <- merge(g.rank,g.length, by="retweet_id_str")
结果:
retweet_id_str screen_name retweet_screen_name tweet_created_at rank length
1 4.478178e+17 eyyupaluclu GuneseYuruyen 2015-06-07 16:30:34 1 1
2 4.504073e+17 eyyupaluclu melikemumcuoglu 2015-06-07 16:30:00 1 1
3 5.489578e+17 hadi_elis dr_capa 2015-06-05 09:23:09 1 2
4 5.489578e+17 BozanHalit dr_capa 2015-06-05 09:33:56 2 2
5 5.552862e+17 cevatdemiral haber3com 2015-06-21 00:54:09 1 3
6 5.552862e+17 cevatdemiral haber3com 2015-06-21 23:59:04 2 3
7 5.552862e+17 cevatdemiral haber3com 2015-06-22 21:54:55 3 3
在我的数据集中,screen_name中的用户可以重复,所以如果一个用户在不同的组中重复,我需要计算这个;
user= [group1(lenght/rank) + group2(lenght/rank)]/ total group number
* Each unique "retweet_id_str" is a one group.
示例;
- One user-two different group:
eyyupaluclu =[group1(1/1)+group2(1/1)]/2 = 1
我该怎么做?
提前感谢您提供的任何帮助。
PS:如果我的问题不清楚,请告诉我。我准备说清楚了。
数据样本:
sample.df <-structure(list(screen_name = c("eyyupaluclu", "eyyupaluclu",
"hadi_elis", "BozanHalit", "cevatdemiral", "cevatdemiral", "cevatdemiral",
"hadi_elis", "xtutunusx", "hadi_elis", "umutsu15", "OkanBoyner",
"BayarTun", "Ayhan34Isikli", "JindaAxin", "JindaAxin", "JindaAxin",
"JindaAxin", "OtoTeski", "b8f767a3022b4ee", "OtoTeski", "b8f767a3022b4ee",
"OtoTeski", "b8f767a3022b4ee", "OtoTeski", "OtoTeski", "OtoTeski",
"ankakusu1963", "ankakusu1963", "ankakusu1963", "cengizbayel",
"tarlaci5334", "sehven55", "cengizbayel", "ErdogduKezban", "Ayhan34Isikli",
"Ayhan34Isikli", "melekaydinkocak", "IrtegunUgur", "IrtegunUgur",
"melekaydinkocak", "AKinonuatasehir", "RTESLM", "vardar_filiz",
"IrtegunUgur", "AksemsettinMh", "glcihansnmezer", "esesli_murat",
"huseyinvarlik26", "ahmetkaraman001"),
retweet_screen_name = c("GuneseYuruyen","melikemumcuoglu", "dr_capa", "dr_capa", "haber3com", "haber3com",
"haber3com", "GeorgetownDG", "19811923_", "meforum", "BasarKurtan",
"SBELBULUT", "tuncaybayar52", "Akkadinistanbul", "medyayakurdi",
"medyayakurdi", "medyayakurdi", "medyayakurdi", "twit_komedyeni",
"twit_komedyeni", "twit_komedyeni", "twit_komedyeni", "twit_komedyeni",
"twit_komedyeni", "twit_komedyeni", "twit_komedyeni", "twit_komedyeni",
"mr_dogan", "mr_dogan", "mr_dogan", "memetsimsek", "DevletBaskanRTE",
"ErkanGuven", "memetsimsek", "Akkadinistanbul", "Akkadinistanbul",
"Akkadinistanbul", "_aliuzun", "akgencistanbul", "akgenc_kadikoy",
"Hocazade_", "atasehirakparti", "oguz__kaya", "oguz__kaya", "AkTanitimMedya",
"AkTanitimMedya", "AkTanitimMedya", "GencAkparti26", "GencAkparti26",
"GencAkparti26"),
tweet_created_at = c("2015-06-07 16:30:34", "2015-06-07 16:30:00", "2015-06-05 09:23:09", "2015-06-05 09:33:56",
"2015-06-21 00:54:09", "2015-06-21 23:59:04", "2015-06-22 21:54:55",
"2015-05-18 23:05:59", "2015-06-03 06:17:24", "2015-05-31 13:48:10",
"2015-05-28 12:18:45", "2015-05-28 17:01:07", "2015-06-03 16:48:57",
"2015-05-09 07:19:29", "2015-05-09 07:36:41", "2015-05-09 07:36:46",
"2015-05-09 07:36:50", "2015-05-09 07:36:52", "2015-05-14 09:43:00",
"2015-06-13 05:19:03", "2015-05-14 09:42:39", "2015-06-13 05:18:48",
"2015-05-14 09:42:42", "2015-06-13 05:18:54", "2015-05-14 09:42:50",
"2015-05-14 09:42:47", "2015-05-14 09:42:53", "2015-05-17 23:06:16",
"2015-05-17 23:05:08", "2015-05-17 23:04:56", "2015-05-09 16:32:06",
"2015-05-09 17:35:28", "2015-05-08 03:14:29", "2015-05-09 16:31:50",
"2015-05-08 00:24:57", "2015-05-09 07:17:42", "2015-05-09 07:17:38",
"2015-05-16 19:29:58", "2015-05-08 07:15:22", "2015-05-08 07:15:18",
"2015-05-16 19:29:25", "2015-05-08 03:21:30", "2015-05-14 06:50:50",
"2015-05-14 06:54:07", "2015-05-08 07:14:13", "2015-05-09 17:41:35",
"2015-05-09 17:58:56", "2015-05-08 04:59:54", "2015-05-09 02:34:12",
"2015-05-10 07:38:01"),
retweet_id_str = c(447817783829860352, 450407343604629504, 548957776895303680, 548957776895303680, 555286212916035584,
555286212916035584, 555286212916035584, 561187125438451712, 573054097726840832,
584809040380760064, 587723931919986688, 588382883733176320, 592766311387832320,
593336106013347840, 593453258716409856, 593453420641652736, 593453598975119360,
593453994799935488, 594907386885836800, 594907386885836800, 594907487125577728,
594907487125577728, 594907617866166272, 594907617866166272, 594907731506667520,
594907807331254272, 594907964017881088, 594957981340532736, 594961968521420800,
594964791598387200, 595130224523743232, 595160203596865536, 595176402967777280,
595183002243719168, 595211840055078912, 595211840055078912, 595211943088009216,
595212869400186880, 595212974190678016, 595213691026591744, 595213757216858112,
595213790863568896, 595214683541544960, 595214683541544960, 595214727321677824,
595214727321677824, 595214727321677824, 595214737861804032, 595214737861804032,
595214737861804032)),
.Names = c("screen_name", "retweet_screen_name",
"tweet_created_at", "retweet_id_str"), class = c("tbl_df", "data.frame"),
row.names = c(NA, -50L))
很简单。看起来您想要每个 screen_name
:
length/rank
的平均值
test %>% mutate(l_over_r = length / rank) %>%
group_by(screen_name) %>%
summarize(user = mean(l_over_r))
如果一个用户只在一个组中,则平均值当然是相同的值。如果用户在 2 个组中,则 group1(l_over_r) + group2(l_over_r) / 2
是平均值(均值),这很好地概括了。如果你真的只想为 exactly 两组中的用户计算这个,那么你可以预过滤:
test %>% mutate(l_over_r = length / rank) %>%
group_by(screen_name) %>%
filter(n() == 2) %>%
summarize(user = mean(l_over_r))
# Source: local data frame [3 x 2]
#
# screen_name user
# (chr) (dbl)
# 1 cengizbayel 1
# 2 eyyupaluclu 1
# 3 melekaydinkocak 1
作为旁注,由于您正在使用 dplyr
,因此您应该养成使用 left_join
而不是 merge
的习惯,例如 test <- lef_join(g.rank, g.length, by="retweet_id_str")
/