随时间计算成对相似度

Compute pairwise similarity over time

我正在尝试计算随着时间的推移使用相似主题标签的帐户之间的成对相似性。

我有代码(如下),它为我提供了每个帐户发送的最近 300 条推文的帐户之间的成对相似性。但是,我想计算特定时间段(天、周、月)的帐户之间的成对相似性。我该怎么做?

library(rtweet)
library(widyr)
library(tidyverse)

rstats <- search_users("rstats", n = 10)
 
rstats_tmls <- get_timeline(rstats$user_id, n = 300)

rstats_tmls %>%
   unnest(hashtags) %>%
   count(user_id, hashtags) %>%
   pairwise_similarity(user_id, hashtags, n, sort = T, upper = FALSE)


# A tibble: 45 x 3
   item1               item2              similarity
   <chr>               <chr>                   <dbl>
 1 2170413740          792007388358410240      1.00 
 2 2170413740          961691888939126784      1.00 
 3 792007388358410240  961691888939126784      1.00 
 4 1153678152838852614 2170413740              1.00 
 5 1153678152838852614 792007388358410240      1.00 
 6 1153678152838852614 961691888939126784      1.00 
 7 2170413740          824037040996098049      0.998
 8 792007388358410240  824037040996098049      0.998
 9 824037040996098049  961691888939126784      0.998
10 1153678152838852614 824037040996098049      0.998

使用 group_by() 应该有效:

rstats_tmls %>%
  mutate(year = lubridate::year(created_at), 
         week = lubridate::week(created_at)) %>% 
  unnest(hashtags) %>%
  group_by(year, week) %>% 
  count(user_id, hashtags) %>%
  pairwise_similarity(user_id, hashtags, n, sort = T, upper = FALSE)


# # A tibble: 204 × 5
# # Groups:   year, week [112]
#    year  week item1      item2              similarity
#   <dbl> <dbl> <chr>      <chr>                   <dbl>
# 1  2014     3 2170413740 559211484               0.5  
# 2  2014    11 2170413740 559211484               0.707
# 3  2017    28 2170413740 824037040996098049      1    
# 4  2017    29 2170413740 824037040996098049      0.986
# 5  2017    30 2170413740 824037040996098049      1    
# 6  2017    32 2170413740 824037040996098049      0.949
# 7  2017    33 2170413740 824037040996098049      0.962
# 8  2017    34 2170413740 824037040996098049      0.981
# 9  2017    36 2170413740 824037040996098049      0.707
# 10  2017    37 2170413740 824037040996098049      0.943