Quanteda findSequence函数的输出定义-文本分析的R包

Definition of output of Quanteda findSequence function - R package for text analysis

快速提问:

R 文本分析包 Quanteda - findSequence 给出了以下输出,但我找不到有关某些列的文档:

seqs <- findSequences(tokens, types_upper, count_min=2)
head(seqs, 3)
              sequence len          z         p       mue
     3         first time   2 -0.4159751 0.6612859 -165.7366
     8  political parties   2 -0.4159751 0.6612859 -165.7366
     9   preserve protect   2 -0.4159751 0.6612859 -165.7366

谁能帮忙定义 z、p 和 mue p = 概率?如果是这样,如何计算。帮助说 "This algorithm is based on Blaheta and Johnson's “Unsupervised Learning of Multi-Word Verbs”." 但没有提供输出组件的更多详细信息。

功能看起来很有趣,但更多信息会有所帮助。

查看函数代码,然后查看论文,z 是根据 lambda(对数优势比)对 sigma(渐近标准误差)计算得出的。这是一个 z 分数,就像 Pierre 评论的那样,p 是一个概率 1 - stats::pnorm(z)

mue 在 Blaheta 和 Johnson 的 "Unsupervised Learning of Multi-Word Verbs."“µ = λ − 3.29σ...”第 2.3 节的第二段中进行了解释。这对应于将度量 µ 和 µ1 设置为λ 的 0.001 置信区间的下限...,这是在面对嘈杂数据时以召回率换取精确度的系统方法(Johnson,2001)。

如果您转到第 2.3 节,您可以看到更多详细信息:

We propose two different measures of association µ and µ1, which we call the “all subtuples” and “unigram subtuples” measures below. As we explain below, they seem to identify very different kinds of collocations, so both are useful in certain circumstances. These measures are estimates of λ and λ1 respectively, which are particular parameters of certain log-linear models. In cases where the counts are small the estimates of λ and λ1 may be noisy, and so high values from small count data should be discounted in some way when being compared with values from large count data. We do this by also estimating the asymptotic standard error σ and σ1 of λ and λ1 respectively, and set µ = λ − 3.29σ and µ1 = λ1 − 3.29σ1. This corresponds to setting the measures µ and µ1 to the lower bound of a 0.001 confidence interval for λ and λ1 respectively, which is a systematic way of trading recall for precision in the face of noisy data (Johnson, 2001).

有关计算 λ 和 σ 的详细信息(和其他参考资料)也在第 2.3 节中