数据操作：Select 用户基于变量

Question

我目前正在从事一个机器学习项目。我有一个从论坛上删除的大型数据集 www.stormfront.com。数据集有 7 列：stormfront_self_content（论坛帖子）、stormfront_lang_id、stormfront_publication_date、stormfront_topic、stormfront_docid、stormfront_category、stormfront_user.

我想select一组在论坛注册一年以上的用户已经写了超过 500 篇文章，但我不知道该怎么做。

如有任何帮助，我们将不胜感激。

Answer 1

假设您有一些 id 列代表每个用户，我们可以 group_by 每个 id select 组有超过 500 行和天数在 max 和 min 之间的发布日期之间的时间大于 365。

library(dplyr)
library(lubridate)

df %>%
  mutate(stormfront_publication_date = ymd_hms(stormfront_publication_date)) %>%
  group_by(id) %>%
  filter(n() > 500 & difftime(max(stormfront_publication_date), 
                    min(stormfront_publication_date),units = 'days') > 365)

数据操作：Select 用户基于变量

Data manipulation: Select users based on variables

nlp

analysis

r

data-manipulation