在推文语料库中构建文本相似性时间序列

Building text-similarity time series in a corpus of tweets

我想测量文本相似度随时间的演变。我的数据框包含一列推文标识符 (id)、一列日期频率 (date) 和一列推文的整洁文本 (clean_text)。

这里是一个简短的repex,其中包含一些实际清理过的推文:

final <- data.frame(id=1:5,Date=c(as.Date("2020-12-26"),as.Date("2020-12-26"),as.Date("2020-12-27"),
                    as.Date("2020-12-27"),as.Date("2020-12-27")),
              clean_text = c("americans died covid nfebruary people couple days",
                                        "cops crush peoples necks death eric garner",  
                                        "video clip tells george floyd resist arrest earlier claimed police officer", 
                                        "black americans terrible daily dangers outdoor spaces subjected unwarranted suspicion", 
                                         "announcement waiting minneapolis police officer derek chauvin charged manslaug"))

因为我想要一些可以让我衡量 similar/disimilar 文本如何随时间变化的东西,所以我考虑使用一些相似性度量(即频率、余弦相似性...),通过 textstat_simil 来自quanteda.

这是我的尝试:

require(quanteda)            
start <- as.Date("2020-12-26",format="%Y-%m-%d")
    end   <- as.Date("2020-12-27",format="%Y-%m-%d")
    
    theDate <- start
    i=1
    
    similarity<-data.frame(matrix(NA, nrow = as.numeric(end+1-start), ncol = 1))
    #value<-vector(mode="numeric")
    colnames(similarity)<-c("value")
    while (theDate <= end){
      if (nrow(subset(final,final$Date==theDate))>1){
        corp <- corpus(subset(final,final$Date==theDate),
                       docid_field = "id",
                       text_field = "clean_text")
        a<-as.matrix(dfm(corp) %>%
                       textstat_simil())
        
            similarity$value[i]<-sum(a[lower.tri(a)])/length(corp)
      }else{
        similarity$value[i]<-0
      } 
      i<-i+1
      theDate <- theDate + 1 
      
    }

在此代码中,我对相关矩阵 a 的下三角元素求和并除以推文数量以获得推文相关性的“平均值”。我认为这是有问题的步骤,因为当我将此函数应用于我的推文语料库时,相似度恰恰在推文非常相关的时刻下降。

此外,我必须说,我的目标是获得一个时间序列,其中一列用于衡量相似性,另一列用于日期,以便可以对其进行绘制。也许有更简单的 npl 方法来跟踪这个?

编辑:刚刚意识到矩阵 a 中的相关项是负数。不确定如何解释或者编码是否有错误?

这个怎么样:

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

dfmat <- final %>%
  corpus(text_field = "clean_text") %>%
  tokens() %>%
  dfm()

现在我们将按日期创建一系列 dfm 对象:

date_splits <- split(seq_len(ndoc(dfmat)), dfmat$Date)
date_splits
## $`2020-12-26`
## [1] 1 2
## 
## $`2020-12-27`
## [1] 3 4 5

dfmat_split <- lapply(date_splits, function(x) dfmat[x, ])
dfmat_split
## $`2020-12-26`
## Document-feature matrix of: 2 documents, 41 features (82.93% sparse) and 2 docvars.
##        features
## docs    americans died covid nfebruary people couple days cops crush peoples
##   text1         1    1     1         1      1      1    1    0     0       0
##   text2         0    0     0         0      0      0    0    1     1       1
## [ reached max_nfeat ... 31 more features ]
## 
## $`2020-12-27`
## Document-feature matrix of: 3 documents, 41 features (75.61% sparse) and 2 docvars.
##        features
## docs    americans died covid nfebruary people couple days cops crush peoples
##   text3         0    0     0         0      0      0    0    0     0       0
##   text4         1    0     0         0      0      0    0    0     0       0
##   text5         0    0     0         0      0      0    0    0     0       0
## [ reached max_nfeat ... 31 more features ]

然后,通过将每个 textstat_simil() 矩阵输出强制转换为 data.frame,按日期创建一个包含 data.frame 个结果的列表,然后在输出 [=25] 中加入这些结果=] 我们在其中添加日期列。这给了我们 data.frame 的(唯一)成对相似性,按日期:

simil_by_date <- lapply(dfmat_split, function(x) {
  textstat_simil(x) %>%
    as.data.frame()
})
simil_by_date
## $`2020-12-26`
##   document1 document2 correlation
## 1     text1     text2  -0.2058824
## 
## $`2020-12-27`
##   document1 document2 correlation
## 1     text3     text4 -0.34391797
## 2     text3     text5 -0.05514368
## 3     text4     text5 -0.30120725

df <- data.frame(
  date = as.Date(rep(names(simil_by_date), sapply(simil_by_date, nrow))),
  do.call(rbind, simil_by_date),
  row.names = NULL
)
df
##         date document1 document2 correlation
## 1 2020-12-26     text1     text2 -0.20588235
## 2 2020-12-27     text3     text4 -0.34391797
## 3 2020-12-27     text3     text5 -0.05514368
## 4 2020-12-27     text4     text5 -0.30120725

现在可以使用您选择的函数轻松聚合相似性。在这里,我使用 dplyr 通过平均相关性来做到这一点:

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df %>%
  group_by(date) %>%
  summarise(simil = mean(correlation))
## # A tibble: 2 x 2
##   date        simil
##   <date>      <dbl>
## 1 2020-12-26 -0.206
## 2 2020-12-27 -0.233