Quanteda：我如何 select 并检查 FCM 中的特定功能？

Question

我有一个 8,347 x 8,347 的特征共现矩阵，其中 tri = FALSE。我希望能够单独 select 一个特征，这样我就可以看到哪些术语经常与它同时出现。看起来这需要 select 编辑特征的列并按降序对关联的行进行排序。

fcm_select 不起作用，因为它隔离了列和行中的术语：

>SELECT_FROM_FCM = fcm_select(
    MY_FCM,
    pattern = c("FEATURE"),
    selection = c("keep"),
    valuetype = c("glob"),
    case_insensitive = TRUE
)

>View(SELECT_FROM_FCM)

--------------------
|         | FEATURE |
 --------------------
| FEATURE | 667     |
 --------------------

dfm_subset 似乎也不起作用。我是不是用错了方法？

Answer 1

您可以形成 fcm，然后 select 使用正常的矩阵索引操作。在这个例子中，我从最后 10 个就职演讲中形成了一个文档上下文特征共现矩阵，并搜索与特征 "war" 和 "terror".[=13= 共现的特征]

library("quanteda")
## Package version: 2.0.1

fcmat <- data_corpus_inaugural %>%
  tail(10) %>%
  tokens(remove_punct = TRUE) %>%
  fcm()

# select a specific feature
fcmat[, c("war", "terror")]
## Feature co-occurrence matrix of: 3,467 by 2 features.
##            features
## features    war terror
##   Senator    10      2
##   Hatfield    1      1
##   Mr         18      3
##   Chief       7      1
##   Justice     7      1
##   President  32      8
##   Vice        9      2
##   Bush        4      2
##   Mondale     1      1
##   Baker       1      1
## [ reached max_feat ... 3,457 more features ]

在即将发布的 2.1.0 版本中（截至 2020 年 6 月 5 日仅在 GitHub 上可用），您可以使用 char_select() 获取特征的模式匹配，例如：

# only in forthcoming 2.1.0 (currently on GitHub)
fcmat[, char_select(featnames(fcmat), "terror*")]
## Feature co-occurrence matrix of: 3,467 by 2 features.
##            features
## features    terror terrorism
##   Senator        2         2
##   Hatfield       1         1
##   Mr             3         3
##   Chief          1         2
##   Justice        1         2
##   President      8        10
##   Vice           2         2
##   Bush           2         2
##   Mondale        1         1
##   Baker          1         1
## [ reached max_feat ... 3,457 more features ]

最后，如果您最终需要的话，这些 fcm 结果很容易转换为 data.frame 或正则矩阵以用于输出和在其他系统中使用。

Quanteda：我如何 select 并检查 FCM 中的特定功能？

Quanteda: How can I select and examine a specific feature within a FCM?

r

quanteda