随着时间的推移按组查找重复 sentences/words/phrases
Finding repeated sentences/words/phrases by group over time
我有一个数据集,其中每一列都是一个变量,每一行都是一个观察值(比如时间序列数据。它看起来像这样(我对格式表示歉意,但我无法显示数据) :
我想知道一个人或一个团体是否一直在说同样的话。我熟悉 n-gram,但这并不是我所需要的。任何帮助,将不胜感激。
这是我想要的输出:
抱歉所有的编辑差评;还在习惯这个网站。
是这样的吗?
df <-data.frame(date = Sys.Date() - sample(10),
Group = c("Cars","Trucks") %>% sample(10,replace=T),
Reporting_person = c("A","B","C") %>% sample(10,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(10,replace=T))
# date Group Reporting_person Comments
# 1 2017-06-08 Trucks B Awesome
# 2 2017-06-05 Trucks A Awesome
# 3 2017-06-14 Cars B Meh
# 4 2017-06-06 Cars B Awesome
# 5 2017-06-11 Cars A Meh
# 6 2017-06-07 Cars B NC
# 7 2017-06-09 Cars A NC
# 8 2017-06-10 Cars A NC
# 9 2017-06-13 Trucks C Awesome
# 10 2017-06-12 Trucks B NC
aggregate(date ~ .,df,length)
# Group Reporting_person Comments date
# 1 Trucks A Awesome 1
# 2 Cars B Awesome 1
# 3 Trucks B Awesome 1
# 4 Trucks C Awesome 1
# 5 Cars A Meh 1
# 6 Cars B Meh 1
# 7 Cars A NC 2
# 8 Cars B NC 1
# 9 Trucks B NC 1
如果您想查看与每个人相关的每条评论的频率和新列 Ready,您可以使用以下代码执行此操作:
set.seed(123456)
### I use the same data as the previous example, thank you for providing this !
data <-data.frame(date = Sys.Date() - sample(100),
Group = c("Cars","Trucks") %>% sample(100,replace=T),
Reporting_person = c("A","B","C") %>% sample(100,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(100,replace=T),
Ready = as.character(c("Yes","No") %>% sample(100,replace=T))
)
library(dplyr)
data %>%
group_by(Reporting_person,Ready) %>%
count(Comments) %>%
mutate(prop = prop.table(n))
如果您要查看评论是否随时间发生变化,并查看该变化是否与事件相关(如 Ready),您可以使用以下代码:
library(dplyr)
### Creating a column comments at time + plus
new = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(comments_plusone=lag(Comments))
new = na.omit(new)
### Creating the change column 1 is a change , 0 no change
new$Change = as.numeric(new$Comments != new$comments_plusone)
### Get the correlation between Change and the events...
### Chi-test to test if correlation between the event and the change
### Not that using Pearson correlation is not pertinent here :
tbl <- table(new$Ready,new$Change)
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
sqrt(chi2$statistic / sum(tbl))
您应该不会与此示例有任何显着相关性。正如您在说明 table.
时可以清楚地看到的
plot(tbl)
并不是说使用 cor 函数不适合处理两个二进制变量。
这里是这个话题的 post.... Correlation between two binary
状态变化的变化频率
根据您的意见,我添加了这段代码:
newR = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(Ready_plusone=lag(Ready))
newR = na.omit(newR)
###------------------------Add the column to the new data frame
### Creating the REady change column 1 is a change , 0 no change
### Creating the change of state , I use this because you seem to have more than 2 levels.
new$State_change = paste(newR$Ready,newR$Ready_plusone,sep="_")
### Getting the frequency of Change by Change of State(Ready Yes-no..no-yes..)
result <- new %>%
group_by(Reporting_person,State_change) %>%
count(Change) %>%
mutate(Frequence = prop.table(n))%>%
filter(Change==1)
### Tidyr is a great library for reshape data, you want the wide format of the previous long
### dataframe... However doing this will generate a lot of NA so If I were you I would get
### the result format instead of the following but this could be helpful for future need so here you go.
library(tidyr)
final = as.data.frame(spread(result, key = State_change, value = prop))[,c(1,4:7)]
希望对您有所帮助:)
我有一个数据集,其中每一列都是一个变量,每一行都是一个观察值(比如时间序列数据。它看起来像这样(我对格式表示歉意,但我无法显示数据) :
我想知道一个人或一个团体是否一直在说同样的话。我熟悉 n-gram,但这并不是我所需要的。任何帮助,将不胜感激。
这是我想要的输出:
抱歉所有的编辑差评;还在习惯这个网站。
是这样的吗?
df <-data.frame(date = Sys.Date() - sample(10),
Group = c("Cars","Trucks") %>% sample(10,replace=T),
Reporting_person = c("A","B","C") %>% sample(10,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(10,replace=T))
# date Group Reporting_person Comments
# 1 2017-06-08 Trucks B Awesome
# 2 2017-06-05 Trucks A Awesome
# 3 2017-06-14 Cars B Meh
# 4 2017-06-06 Cars B Awesome
# 5 2017-06-11 Cars A Meh
# 6 2017-06-07 Cars B NC
# 7 2017-06-09 Cars A NC
# 8 2017-06-10 Cars A NC
# 9 2017-06-13 Trucks C Awesome
# 10 2017-06-12 Trucks B NC
aggregate(date ~ .,df,length)
# Group Reporting_person Comments date
# 1 Trucks A Awesome 1
# 2 Cars B Awesome 1
# 3 Trucks B Awesome 1
# 4 Trucks C Awesome 1
# 5 Cars A Meh 1
# 6 Cars B Meh 1
# 7 Cars A NC 2
# 8 Cars B NC 1
# 9 Trucks B NC 1
如果您想查看与每个人相关的每条评论的频率和新列 Ready,您可以使用以下代码执行此操作:
set.seed(123456)
### I use the same data as the previous example, thank you for providing this !
data <-data.frame(date = Sys.Date() - sample(100),
Group = c("Cars","Trucks") %>% sample(100,replace=T),
Reporting_person = c("A","B","C") %>% sample(100,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(100,replace=T),
Ready = as.character(c("Yes","No") %>% sample(100,replace=T))
)
library(dplyr)
data %>%
group_by(Reporting_person,Ready) %>%
count(Comments) %>%
mutate(prop = prop.table(n))
如果您要查看评论是否随时间发生变化,并查看该变化是否与事件相关(如 Ready),您可以使用以下代码:
library(dplyr)
### Creating a column comments at time + plus
new = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(comments_plusone=lag(Comments))
new = na.omit(new)
### Creating the change column 1 is a change , 0 no change
new$Change = as.numeric(new$Comments != new$comments_plusone)
### Get the correlation between Change and the events...
### Chi-test to test if correlation between the event and the change
### Not that using Pearson correlation is not pertinent here :
tbl <- table(new$Ready,new$Change)
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
sqrt(chi2$statistic / sum(tbl))
您应该不会与此示例有任何显着相关性。正如您在说明 table.
时可以清楚地看到的plot(tbl)
并不是说使用 cor 函数不适合处理两个二进制变量。
这里是这个话题的 post.... Correlation between two binary
状态变化的变化频率
根据您的意见,我添加了这段代码:
newR = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(Ready_plusone=lag(Ready))
newR = na.omit(newR)
###------------------------Add the column to the new data frame
### Creating the REady change column 1 is a change , 0 no change
### Creating the change of state , I use this because you seem to have more than 2 levels.
new$State_change = paste(newR$Ready,newR$Ready_plusone,sep="_")
### Getting the frequency of Change by Change of State(Ready Yes-no..no-yes..)
result <- new %>%
group_by(Reporting_person,State_change) %>%
count(Change) %>%
mutate(Frequence = prop.table(n))%>%
filter(Change==1)
### Tidyr is a great library for reshape data, you want the wide format of the previous long
### dataframe... However doing this will generate a lot of NA so If I were you I would get
### the result format instead of the following but this could be helpful for future need so here you go.
library(tidyr)
final = as.data.frame(spread(result, key = State_change, value = prop))[,c(1,4:7)]
希望对您有所帮助:)