使用 R(Quanteda 或 Tidytext 包)在文本数据中的关键字周围提取 100 个字符 Window
Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)
这是我第一次在这里提问,所以我希望我没有错过任何关键部分。我想对 windows 围绕某些关键字的演讲进行情绪分析。我的数据集是一个包含大量演讲的大型 csv 文件,但我只对某些关键词周围的词的情绪感兴趣。
有人告诉我,R 中的 quanteda 包可能是我找到这样一个函数的最佳选择,但到目前为止我一直没有找到它。如果有人知道如何完成这样的任务,将不胜感激!!!
下面的 Reprex(我希望?):
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word Whosebug. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. Whosebug.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
使用量子:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "Whosebug",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | Whosebug | However there are
[2, 24] the very end | Whosebug |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word Whosebug However there are Whosebug
2 2 24 24 the very end Whosebug Whosebug
现在阅读 kwic
的帮助(在控制台中使用 ?kwic
),看看您可以使用哪种模式。使用 tokens
,您可以在使用 kwic
之前指定要使用的数据清理。在我的示例中,我删除了标点符号。
最终结果是一个数据框,关键字前后都有 window。在此示例中,长度为 3 的 window。之后,您可以对前结果和 post 结果进行某种形式的情绪分析(或先将它们粘贴在一起)。
我建议使用 tokens_select()
并将 window
参数设置为目标术语周围的一系列标记。
以您的示例为例,如果“Whosebug”是目标术语,并且您想衡量围绕它的 +/- 10 个标记的情绪,那么这将起作用:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("Whosebug", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "Whosebug" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "Whosebug" "."
##
## text3 :
## character(0)
有很多方法可以从这一点计算情绪。一个简单的方法是应用情感字典,例如
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
这是我第一次在这里提问,所以我希望我没有错过任何关键部分。我想对 windows 围绕某些关键字的演讲进行情绪分析。我的数据集是一个包含大量演讲的大型 csv 文件,但我只对某些关键词周围的词的情绪感兴趣。
有人告诉我,R 中的 quanteda 包可能是我找到这样一个函数的最佳选择,但到目前为止我一直没有找到它。如果有人知道如何完成这样的任务,将不胜感激!!!
下面的 Reprex(我希望?):
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word Whosebug. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. Whosebug.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
使用量子:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "Whosebug",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | Whosebug | However there are
[2, 24] the very end | Whosebug |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word Whosebug However there are Whosebug
2 2 24 24 the very end Whosebug Whosebug
现在阅读 kwic
的帮助(在控制台中使用 ?kwic
),看看您可以使用哪种模式。使用 tokens
,您可以在使用 kwic
之前指定要使用的数据清理。在我的示例中,我删除了标点符号。
最终结果是一个数据框,关键字前后都有 window。在此示例中,长度为 3 的 window。之后,您可以对前结果和 post 结果进行某种形式的情绪分析(或先将它们粘贴在一起)。
我建议使用 tokens_select()
并将 window
参数设置为目标术语周围的一系列标记。
以您的示例为例,如果“Whosebug”是目标术语,并且您想衡量围绕它的 +/- 10 个标记的情绪,那么这将起作用:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("Whosebug", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "Whosebug" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "Whosebug" "."
##
## text3 :
## character(0)
有很多方法可以从这一点计算情绪。一个简单的方法是应用情感字典,例如
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0