如何计算单词与文档中特定术语的接近度
How to calculate proximity of words to a specific term in a document
我正在尝试找出一种方法来计算文档中特定术语的单词接近度以及平均接近度(按单词)。我知道在 SO 上也有类似的问题,但没有什么能给我我需要的答案,甚至没有给我指出有帮助的地方。假设我有以下文本:
song <- "Far over the misty mountains cold To dungeons deep and caverns old We
must away ere break of day To seek the pale enchanted gold. The dwarves of
yore made mighty spells, While hammers fell like ringing bells In places deep,
where dark things sleep, In hollow halls beneath the fells. For ancient king
and elvish lord There many a gleaming golden hoard They shaped and wrought,
and light they caught To hide in gems on hilt of sword. On silver necklaces
they strung The flowering stars, on crowns they hung The dragon-fire, in
twisted wire They meshed the light of moon and sun. Far over the misty
mountains cold To dungeons deep and caverns old We must away, ere break of
day, To claim our long-forgotten gold. Goblets they carved there for
themselves And harps of gold; where no man delves There lay they long, and
many a song Was sung unheard by men or elves. The pines were roaring on the
height, The winds were moaning in the night. The fire was red, it flaming
spread; The trees like torches blazed with light. The bells were ringing in
the dale And men they looked up with faces pale; The dragon’s ire more fierce
than fire Laid low their towers and houses frail. The mountain smoked beneath
the moon; The dwarves they heard the tramp of doom. They fled their hall to
dying fall Beneath his feet, beneath the moon. Far over the misty mountains
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"
我希望能够看到单词 "fire" 两边(左边 15 个,右边 15 个)的 15 个(我希望这个数字可以互换)单词中出现的单词(也可互换)每次出现。对于 "fire." 的每个实例,我想查看每个单词及其在这 15 个单词范围内出现的次数,因此,例如,"fire" 被使用了 3 次。在这 3 次中,单词 "light" 两次落在两边的 15 个单词之内。我想要一个 table 来显示单词、它在指定的接近度 15 内出现的次数、最大距离(在本例中为 12)、最小距离(为 7)以及平均距离(9.5)。
我想我需要几个步骤和包来完成这项工作。我的第一个想法是使用 quanteda 的 "kwic" 函数,因为它允许您围绕特定术语选择 "window"。然后基于 kwic 结果的术语频率计数并不那么困难(针对频率删除了停用词,但不是针对单词接近度度量)。我真正的问题是找到与焦点项的最大、最小和平均距离,然后将结果变成一个漂亮整洁的 table,其中项作为行按频率降序排列,列给我频率计数,最大距离、最小距离和平均距离。
这是我目前的情况:
library(quanteda)
library(tm)
mysong <- char_tolower(song)
toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE,
remove_numbers = TRUE, remove_symbols = TRUE)
mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)
thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))
kwicFreq <- termFreq(thekwic)
非常感谢任何帮助。
我建议结合使用我的 tidytext and fuzzyjoin 软件包来解决这个问题。
您可以先将其标记化为一个单词一行的数据框,添加一个 position
列,然后删除停用词:
library(tidytext)
library(dplyr)
all_words <- data_frame(text = song) %>%
unnest_tokens(word, text) %>%
mutate(position = row_number()) %>%
filter(!word %in% tm::stopwords("en"))
然后您可以只找到单词 fire
,并使用 fuzzyjoin 中的 difference_inner_join()
来查找这些行的 15 个单词以内的所有行。然后,您可以使用 group_by()
和 summarize()
来获取每个单词所需的统计信息。
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == "fire") %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
mutate(distance = abs(focus_position - position))
words_summarized <- nearby_words %>%
group_by(word) %>%
summarize(number = n(),
maximum_distance = max(distance),
minimum_distance = min(distance),
average_distance = mean(distance)) %>%
arrange(desc(number))
这种情况下的输出:
# A tibble: 49 × 5
word number maximum_distance minimum_distance average_distance
<chr> <int> <dbl> <dbl> <dbl>
1 fire 3 0 0 0.0
2 light 2 12 7 9.5
3 moon 2 13 9 11.0
4 bells 1 14 14 14.0
5 beneath 1 11 11 11.0
6 blazed 1 10 10 10.0
7 crowns 1 5 5 5.0
8 dale 1 15 15 15.0
9 dragon 1 1 1 1.0
10 dragon’s 1 5 5 5.0
# ... with 39 more rows
请注意,此方法还允许您一次对多个焦点词执行分析。您所要做的就是将 filter(word == "fire")
更改为 filter(word %in% c("fire", "otherword"))
,并将 group_by(word)
更改为 group_by(focus_word, word)
。
tidytext 答案很好,但是 quanteda 中有一些工具可以适用于此。在 window 中计数的主要函数不是 kwic()
而是 fcm()
(特征共现矩阵)。
require(quanteda)
# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
## features
## features fire
## Far 1
## over 1
## the 5
## misty 1
## mountains 0
## cold 0
head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
## features
## features fire
## light 2
要获得单词与目标的平均距离,需要对距离的权重函数进行一些修改。下面,根据位置应用权重来考虑计数,当将这些求和然后除以 window 内的总频率时,这会提供加权平均值。对于您的 "light" 示例,例如:
# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## light 9.5
## features fire
获得最小和最大位置有点复杂,虽然我可以想出一种方法 "hack" 使用权重组合在每个位置定位二进制掩码然后将其转换为距离。 (太笨拙了,所以我推荐整洁的解决方案,除非我想到更优雅的方式。)
我正在尝试找出一种方法来计算文档中特定术语的单词接近度以及平均接近度(按单词)。我知道在 SO 上也有类似的问题,但没有什么能给我我需要的答案,甚至没有给我指出有帮助的地方。假设我有以下文本:
song <- "Far over the misty mountains cold To dungeons deep and caverns old We
must away ere break of day To seek the pale enchanted gold. The dwarves of
yore made mighty spells, While hammers fell like ringing bells In places deep,
where dark things sleep, In hollow halls beneath the fells. For ancient king
and elvish lord There many a gleaming golden hoard They shaped and wrought,
and light they caught To hide in gems on hilt of sword. On silver necklaces
they strung The flowering stars, on crowns they hung The dragon-fire, in
twisted wire They meshed the light of moon and sun. Far over the misty
mountains cold To dungeons deep and caverns old We must away, ere break of
day, To claim our long-forgotten gold. Goblets they carved there for
themselves And harps of gold; where no man delves There lay they long, and
many a song Was sung unheard by men or elves. The pines were roaring on the
height, The winds were moaning in the night. The fire was red, it flaming
spread; The trees like torches blazed with light. The bells were ringing in
the dale And men they looked up with faces pale; The dragon’s ire more fierce
than fire Laid low their towers and houses frail. The mountain smoked beneath
the moon; The dwarves they heard the tramp of doom. They fled their hall to
dying fall Beneath his feet, beneath the moon. Far over the misty mountains
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"
我希望能够看到单词 "fire" 两边(左边 15 个,右边 15 个)的 15 个(我希望这个数字可以互换)单词中出现的单词(也可互换)每次出现。对于 "fire." 的每个实例,我想查看每个单词及其在这 15 个单词范围内出现的次数,因此,例如,"fire" 被使用了 3 次。在这 3 次中,单词 "light" 两次落在两边的 15 个单词之内。我想要一个 table 来显示单词、它在指定的接近度 15 内出现的次数、最大距离(在本例中为 12)、最小距离(为 7)以及平均距离(9.5)。
我想我需要几个步骤和包来完成这项工作。我的第一个想法是使用 quanteda 的 "kwic" 函数,因为它允许您围绕特定术语选择 "window"。然后基于 kwic 结果的术语频率计数并不那么困难(针对频率删除了停用词,但不是针对单词接近度度量)。我真正的问题是找到与焦点项的最大、最小和平均距离,然后将结果变成一个漂亮整洁的 table,其中项作为行按频率降序排列,列给我频率计数,最大距离、最小距离和平均距离。
这是我目前的情况:
library(quanteda)
library(tm)
mysong <- char_tolower(song)
toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE,
remove_numbers = TRUE, remove_symbols = TRUE)
mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)
thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))
kwicFreq <- termFreq(thekwic)
非常感谢任何帮助。
我建议结合使用我的 tidytext and fuzzyjoin 软件包来解决这个问题。
您可以先将其标记化为一个单词一行的数据框,添加一个 position
列,然后删除停用词:
library(tidytext)
library(dplyr)
all_words <- data_frame(text = song) %>%
unnest_tokens(word, text) %>%
mutate(position = row_number()) %>%
filter(!word %in% tm::stopwords("en"))
然后您可以只找到单词 fire
,并使用 fuzzyjoin 中的 difference_inner_join()
来查找这些行的 15 个单词以内的所有行。然后,您可以使用 group_by()
和 summarize()
来获取每个单词所需的统计信息。
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == "fire") %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
mutate(distance = abs(focus_position - position))
words_summarized <- nearby_words %>%
group_by(word) %>%
summarize(number = n(),
maximum_distance = max(distance),
minimum_distance = min(distance),
average_distance = mean(distance)) %>%
arrange(desc(number))
这种情况下的输出:
# A tibble: 49 × 5
word number maximum_distance minimum_distance average_distance
<chr> <int> <dbl> <dbl> <dbl>
1 fire 3 0 0 0.0
2 light 2 12 7 9.5
3 moon 2 13 9 11.0
4 bells 1 14 14 14.0
5 beneath 1 11 11 11.0
6 blazed 1 10 10 10.0
7 crowns 1 5 5 5.0
8 dale 1 15 15 15.0
9 dragon 1 1 1 1.0
10 dragon’s 1 5 5 5.0
# ... with 39 more rows
请注意,此方法还允许您一次对多个焦点词执行分析。您所要做的就是将 filter(word == "fire")
更改为 filter(word %in% c("fire", "otherword"))
,并将 group_by(word)
更改为 group_by(focus_word, word)
。
tidytext 答案很好,但是 quanteda 中有一些工具可以适用于此。在 window 中计数的主要函数不是 kwic()
而是 fcm()
(特征共现矩阵)。
require(quanteda)
# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
## features
## features fire
## Far 1
## over 1
## the 5
## misty 1
## mountains 0
## cold 0
head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
## features
## features fire
## light 2
要获得单词与目标的平均距离,需要对距离的权重函数进行一些修改。下面,根据位置应用权重来考虑计数,当将这些求和然后除以 window 内的总频率时,这会提供加权平均值。对于您的 "light" 示例,例如:
# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## light 9.5
## features fire
获得最小和最大位置有点复杂,虽然我可以想出一种方法 "hack" 使用权重组合在每个位置定位二进制掩码然后将其转换为距离。 (太笨拙了,所以我推荐整洁的解决方案,除非我想到更优雅的方式。)