在 R 中组合 nest() 和 aggregate()?
Combine nest() and aggregate() in R?
寻求帮助和建议:
我用 rtweet 包收集了推文。这让我得到了一个数据框,行中包含观察结果(即推文),列中包含变量。变量既在推文级别(例如文本、喜欢、主题标签等)又在帐户级别(关注者数量、简历等)。我 运行 对推文进行情绪分析,将推文级别的情绪分数变量添加到数据框。
模拟我的数据现在的样子(实际上我有 100,000 多个观测值和 115 个变量):
df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
text = c('this is u1 first tweet',
'this is another tweet',
'hello hello',
'hashtag tweettext',
'tweet text',
'this is u1 second tweet',
'this is u6 first tzeet',
'this is u6 second tweet',
'this is u6 third tweet',
'this is u1 third tweet'),
likes= sample(1:10, 10),
sentiment= rnorm(10, mean=0, sd=1),
followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))
users text likes sentiment followers bio
1 u1 this is u1 first tweet 1 0.96445407 111 lorem ipsum u1
2 u2 this is another tweet 10 1.03840459 200 lorem ipsum u2
3 u3 hello hello 7 1.76887362 300 lorem ipsum u3
4 u4 hashtag tweettext 5 -0.57165015 400 lorem ipsum u4
5 u5 tweet text 4 -1.47028289 500 lorem ipsum u5
6 u1 this is u1 second tweet 2 -1.11036644 111 lorem ipsum u1
7 u6 this is u6 first tzeet 3 0.25440339 666 lorem ipsum u6
8 u6 this is u6 second tweet 8 0.02334468 666 lorem ipsum u6
9 u6 this is u6 third tweet 9 -2.71592529 666 lorem ipsum u6
10 u1 this is u1 third tweet 6 1.18528925 111 lorem ipsum u1
现在,我想做的是在用户帐户级别上工作。为此,我想汇总每个用户的喜欢和情绪的平均分数,同时将每个用户的所有推文文本组合到一个向量中(或者一个长字符串也可以)。生物不应合并。
总的来说,聚合是没有问题的:
df%>%
group_by(users)%>%
summarise(meanlikes = mean(likes),
meansentiment = mean(sentiment))
就嵌套数据而言,我是这样的:
data %>%
select(-likes, -sentiment) %>%
nest(-users, -followers, -bio)
将两者结合在一段代码中没有任何意义。我运行这两个操作分别使用了inner_join(),这似乎工作正常,但这种方法非常麻烦,因为我有115个变量。
d1<- df %>%
select(-likes, -sentiment) %>%
nest(-users, -followers, -bio)
d2 <- df %>%
group_by(users)%>%
summarise(meanlikes = mean(likes),
meansentiment = mean(sentiment))
d1 <- d1 %>%
inner_join(d2)
有什么建议吗?
所以要清楚我正在寻找的是一种方法/代码,它给我这个数据框:
users text followers
1 u1 this is u1 first tweet, this is u1 second tweet, this is u1 third tweet 111
2 u2 this is another tweet 200
3 u3 hello hello 300
4 u4 hashtag tweettext 400
5 u5 tweet text 500
6 u6 this is u6 first tzeet, this is u6 second tweet, this is u6 third tweet 666
bio meanlikes meansentiment
1 lorem ipsum u1 4.333333 -0.2846824
2 lorem ipsum u2 6.000000 -0.5443194
3 lorem ipsum u3 2.000000 1.8001123
4 lorem ipsum u4 4.000000 1.0114402
5 lorem ipsum u5 9.000000 -0.5637166
6 lorem ipsum u6 7.000000 1.2346833
希望你能帮帮我!
你可以试试这个:
# set seed to make df reproducible
set.seed(1234)
df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
text = c('this is u1 first tweet',
'this is another tweet',
'hello hello',
'hashtag tweettext',
'tweet text',
'this is u1 second tweet',
'this is u6 first tzeet',
'this is u6 second tweet',
'this is u6 third tweet',
'this is u1 third tweet'),
likes= sample(1:10, 10),
sentiment= rnorm(10, mean=0, sd=1),
followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))
df %>% group_by(users)%>%
mutate(tweets = str_c(text, collapse = ""),
meanlikes = mean(likes),
meansentiment = mean(sentiment)) %>%
select(-text, -likes, -sentiment) %>%
distinct()
您可以 group_by
users
,保留 bio
和 followers
的 first
值,因为它们都是一样的。取 likes
和 sentiment
的 mean
并使用 toString
.
将 text
折叠成一个逗号分隔的字符串
library(dplyr)
df %>%
group_by(users) %>%
summarise(across(c(bio, followers), first),
across(c(likes, sentiment), mean),
text = toString(text))
# users bio followers likes sentiment text
# <chr> <chr> <dbl> <dbl> <dbl> <chr>
#1 u1 lorem i… 111 6.67 0.0870 this is u1 first…
#2 u2 lorem i… 200 8 -0.945 this is another …
#3 u3 lorem i… 300 6 0.225 hello hello
#4 u4 lorem i… 400 3 0.359 hashtag tweettext
#5 u5 lorem i… 500 5 -0.664 tweet text
#6 u6 lorem i… 666 4.33 0.206 this is u6 first…
寻求帮助和建议:
我用 rtweet 包收集了推文。这让我得到了一个数据框,行中包含观察结果(即推文),列中包含变量。变量既在推文级别(例如文本、喜欢、主题标签等)又在帐户级别(关注者数量、简历等)。我 运行 对推文进行情绪分析,将推文级别的情绪分数变量添加到数据框。
模拟我的数据现在的样子(实际上我有 100,000 多个观测值和 115 个变量):
df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
text = c('this is u1 first tweet',
'this is another tweet',
'hello hello',
'hashtag tweettext',
'tweet text',
'this is u1 second tweet',
'this is u6 first tzeet',
'this is u6 second tweet',
'this is u6 third tweet',
'this is u1 third tweet'),
likes= sample(1:10, 10),
sentiment= rnorm(10, mean=0, sd=1),
followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))
users text likes sentiment followers bio
1 u1 this is u1 first tweet 1 0.96445407 111 lorem ipsum u1
2 u2 this is another tweet 10 1.03840459 200 lorem ipsum u2
3 u3 hello hello 7 1.76887362 300 lorem ipsum u3
4 u4 hashtag tweettext 5 -0.57165015 400 lorem ipsum u4
5 u5 tweet text 4 -1.47028289 500 lorem ipsum u5
6 u1 this is u1 second tweet 2 -1.11036644 111 lorem ipsum u1
7 u6 this is u6 first tzeet 3 0.25440339 666 lorem ipsum u6
8 u6 this is u6 second tweet 8 0.02334468 666 lorem ipsum u6
9 u6 this is u6 third tweet 9 -2.71592529 666 lorem ipsum u6
10 u1 this is u1 third tweet 6 1.18528925 111 lorem ipsum u1
现在,我想做的是在用户帐户级别上工作。为此,我想汇总每个用户的喜欢和情绪的平均分数,同时将每个用户的所有推文文本组合到一个向量中(或者一个长字符串也可以)。生物不应合并。
总的来说,聚合是没有问题的:
df%>%
group_by(users)%>%
summarise(meanlikes = mean(likes),
meansentiment = mean(sentiment))
就嵌套数据而言,我是这样的:
data %>%
select(-likes, -sentiment) %>%
nest(-users, -followers, -bio)
将两者结合在一段代码中没有任何意义。我运行这两个操作分别使用了inner_join(),这似乎工作正常,但这种方法非常麻烦,因为我有115个变量。
d1<- df %>%
select(-likes, -sentiment) %>%
nest(-users, -followers, -bio)
d2 <- df %>%
group_by(users)%>%
summarise(meanlikes = mean(likes),
meansentiment = mean(sentiment))
d1 <- d1 %>%
inner_join(d2)
有什么建议吗?
所以要清楚我正在寻找的是一种方法/代码,它给我这个数据框:
users text followers
1 u1 this is u1 first tweet, this is u1 second tweet, this is u1 third tweet 111
2 u2 this is another tweet 200
3 u3 hello hello 300
4 u4 hashtag tweettext 400
5 u5 tweet text 500
6 u6 this is u6 first tzeet, this is u6 second tweet, this is u6 third tweet 666
bio meanlikes meansentiment
1 lorem ipsum u1 4.333333 -0.2846824
2 lorem ipsum u2 6.000000 -0.5443194
3 lorem ipsum u3 2.000000 1.8001123
4 lorem ipsum u4 4.000000 1.0114402
5 lorem ipsum u5 9.000000 -0.5637166
6 lorem ipsum u6 7.000000 1.2346833
希望你能帮帮我!
你可以试试这个:
# set seed to make df reproducible
set.seed(1234)
df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
text = c('this is u1 first tweet',
'this is another tweet',
'hello hello',
'hashtag tweettext',
'tweet text',
'this is u1 second tweet',
'this is u6 first tzeet',
'this is u6 second tweet',
'this is u6 third tweet',
'this is u1 third tweet'),
likes= sample(1:10, 10),
sentiment= rnorm(10, mean=0, sd=1),
followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))
df %>% group_by(users)%>%
mutate(tweets = str_c(text, collapse = ""),
meanlikes = mean(likes),
meansentiment = mean(sentiment)) %>%
select(-text, -likes, -sentiment) %>%
distinct()
您可以 group_by
users
,保留 bio
和 followers
的 first
值,因为它们都是一样的。取 likes
和 sentiment
的 mean
并使用 toString
.
text
折叠成一个逗号分隔的字符串
library(dplyr)
df %>%
group_by(users) %>%
summarise(across(c(bio, followers), first),
across(c(likes, sentiment), mean),
text = toString(text))
# users bio followers likes sentiment text
# <chr> <chr> <dbl> <dbl> <dbl> <chr>
#1 u1 lorem i… 111 6.67 0.0870 this is u1 first…
#2 u2 lorem i… 200 8 -0.945 this is another …
#3 u3 lorem i… 300 6 0.225 hello hello
#4 u4 lorem i… 400 3 0.359 hashtag tweettext
#5 u5 lorem i… 500 5 -0.664 tweet text
#6 u6 lorem i… 666 4.33 0.206 this is u6 first…