使用 Quanteda 测量媒体文章中随时间变化的共现模式

Question

我正在尝试测量一年中每个季度的中文报纸文章集中不同单词与特定术语同时出现的次数。为此，我一直在使用 Quanteda，并在每组文章上编写了几个 R 函数运行。我的工作步骤是：

按季度对文章进行分组。
为每个季度的文章生成频率共现矩阵 (FCM)（函数 1）。
从这个矩阵中取出我感兴趣的 'term' 的列，并将其转换为 data.frame（函数 2）
将每个季度的 data.frame 合并在一起，然后生成一个大型 csv 文件，其中每个季度一列，每个同时出现的术语一行。

这似乎工作正常。但我想知道是否有更精通 R 的人能够检查我所做的是否正确，或者可以建议更有效的方法？

感谢您的帮助！

#Function 1 to produce the FCM

get_fcm <- function(data) {
  ch_stop <- stopwords("zh", source = "misc")
  corp = corpus(data)
  toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop)  
  fcm = fcm(toks, context = "window", window = 1, tri = FALSE)
  return(fcm)
}

>fcm_14q4 <- get_fcm(data_14q4)
>fcm_15q1 <- get_fcm(data_15q1)

#Function 2 to select the column for the 'term' of interest (such as China 中国) and make a data.frame

convert2df <- function(matrix, term){
  mat_term = matrix[,term]
  df = convert(mat_term, to = "data.frame")
  colnames(df)[1] = "Term"
  colnames(df)[2] = "Freq"
  x = df[order(-df$Freq),]
  return(x)
}

>CH14 <- convert2df(fcm_14q4, "中国")
>CH15 <- convert2df(fcm_15q1, "中国")

#Merging the data.frames

df <- merge(x=CH14q4, y=CH15q1, by="Term", all.x=TRUE, all.y=TRUE)
df <- merge(x=df, y=CH15q2, by="Term", all.x=TRUE, all.y=TRUE) #etc for all the dataframes...

更新：按照 Ken 在下面评论中的建议，我尝试了一种不同的方式，使用 tokens_select() 的 window 函数，然后使用文档特征矩阵。在根据季度标记语料库文档后，以下 R 函数应采用标记化语料库 toks，然后生成 data.frame 指定 window 内单词共同出现的次数term.

COOCdfm <- function(toks, term, window){
  ch_stop = stopwords("zh", source = "misc")
  cooc_toks = tokens_select(toks, term, window = window)
  cooc_toks2 = tokens(cooc_toks, remove_punct = TRUE)
  cooc_toks3 = tokens_remove(cooc_toks2, ch_stop)
  dfmat = dfm(cooc_toks3)
  dfmat_grouped = dfm_group(dfmat, groups = "quarter")
  counts = convert(t(dfmat_grouped), to = "data.frame")
  colnames(counts)[1] <- "Feature"
  return(counts)
}

Answer 1

如果您有兴趣计算 window 中特定目标术语的共现次数，更好的方法是使用 tokens_select() 的 window 参数，然后计算window-选定标记上 dfm 的出现次数。

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

toks <- tokens(data_corpus_inaugural)

dfmat <- toks %>%
  tokens_select("nuclear", window = 5) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_remove(stopwords("en")) %>%
  dfm()

topfeatures(dfmat)[-1]
##     weapons      threat        work       earth elimination         day 
##           6           3           2           2           2           1 
##         one        free       world 
##           1           1           1

在这里，我首先进行了“保守”标记化以保留所有内容，然后执行上下文选择。然后我进一步处理以删除标点符号和停用词，然后在 dfm 中列出结果。这将很大且非常稀疏，但您可以使用 topfeatures() 或 quanteda.textstats::textstat_frequency().

总结最常见的同现词

使用 Quanteda 测量媒体文章中随时间变化的共现模式

Measuring co-occurence patterns in media articles over time with Quanteda

nlp

r

quanteda