在 quanteda 中替换几个 ngram
substituting several ngrams in quanteda
在我的新闻文章文本中,我想将指代同一政党的几个不同的 ngram 转换为首字母缩写词。我想这样做是因为我想避免任何情感词典将党名(自由党)中的词与不同上下文中的同一个词(自由主义帮助)混淆。
我可以在下面使用 str_replace_all
执行此操作,我知道 quanteda 中的 token_compound()
函数,但它似乎并不能完全满足我的需要。
library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
我是否应该在将文本转换为语料库之前以某种方式对其进行预处理?或者在quanteda
中把它变成语料库后有没有办法做到这一点。
下面是一些扩展的示例代码,可以更好地说明问题:
`text<-c('a text about some political parties called the new democratic party
the new democrats and the liberal party and the liberals. I would like the
word democratic to be counted in the dfm but not the words new democratic.
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))
dfm(text, dictionary=partydict)`
这个例子在 new democratic
和 democratic
的意义上都算作 democratic
,但我会把它们分开计算。
您需要函数 tokens_lookup()
,在定义了一个将规范派对标签定义为键并将派对名称的所有 ngram 变体列为值的字典之后。通过设置 exclusive = FALSE
它将保留不匹配的标记,实际上是用规范的政党名称替换所有变体。
在下面的示例中,我稍微修改了您的输入文本,以说明派对名称的组合方式与使用 "liberal" 而不是 "liberal party" 的短语不同。
library("quanteda")
text<-c('a text about some political parties called the new democratic party
which is conservative the new democrats and the liberal party and the
liberals which are liberal helping poor people')
toks <- tokens(text)
partydict <- dictionary(list(
olp = c("liberal party", "the liberals"),
ndp = c("new democrats", "new democratic party")
))
(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
## [1] "a" "text" "about" "some" "political" "parties"
## [7] "called" "the" "NDP" "which" "is" "conservative"
## [13] "the" "NDP" "and" "the" "OLP" "and"
## [19] "OLP" "which" "are" "liberal" "helping" "poor"
## [25] "people"
所以这已经用派对密钥替换了派对名称差异。
从这个新标记构建一个 dfm 现在发生在这些新标记上,保留可能与情绪相关的(例如)"liberal" 的使用,但已经合并 "liberal party" 并将其替换为 "OLP".将字典应用于 dfm 现在将适用于 "liberal helping" 中的 "liberal" 示例,而不会将其与派对名称中的 "liberal" 混淆。
sentdict <- dictionary(list(
left = c("liberal", "left"),
right = c("conservative", "")
))
dfm(toks2) %>%
dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
## features
## docs olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
## text1 2 2 1 1 1 1 1 1 1 3 2 1 1 2 1 1 1
## features
## docs poor people
## text1 1 1
两个补充说明:
如果您不希望替换标记中的密钥大写,请设置 capkeys = FALSE
.
您可以使用valuetype
参数设置不同的匹配类型,包括valuetype = regex
。 (请注意,示例中的正则表达式可能格式不正确,因为 ndp 示例中 |
运算符的范围将得到 "new democrats" OR "new" 然后是“民主党” . 但是有了 tokens_lookup()
你就不用担心了!)
在我的新闻文章文本中,我想将指代同一政党的几个不同的 ngram 转换为首字母缩写词。我想这样做是因为我想避免任何情感词典将党名(自由党)中的词与不同上下文中的同一个词(自由主义帮助)混淆。
我可以在下面使用 str_replace_all
执行此操作,我知道 quanteda 中的 token_compound()
函数,但它似乎并不能完全满足我的需要。
library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
我是否应该在将文本转换为语料库之前以某种方式对其进行预处理?或者在quanteda
中把它变成语料库后有没有办法做到这一点。
下面是一些扩展的示例代码,可以更好地说明问题:
`text<-c('a text about some political parties called the new democratic party
the new democrats and the liberal party and the liberals. I would like the
word democratic to be counted in the dfm but not the words new democratic.
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))
dfm(text, dictionary=partydict)`
这个例子在 new democratic
和 democratic
的意义上都算作 democratic
,但我会把它们分开计算。
您需要函数 tokens_lookup()
,在定义了一个将规范派对标签定义为键并将派对名称的所有 ngram 变体列为值的字典之后。通过设置 exclusive = FALSE
它将保留不匹配的标记,实际上是用规范的政党名称替换所有变体。
在下面的示例中,我稍微修改了您的输入文本,以说明派对名称的组合方式与使用 "liberal" 而不是 "liberal party" 的短语不同。
library("quanteda")
text<-c('a text about some political parties called the new democratic party
which is conservative the new democrats and the liberal party and the
liberals which are liberal helping poor people')
toks <- tokens(text)
partydict <- dictionary(list(
olp = c("liberal party", "the liberals"),
ndp = c("new democrats", "new democratic party")
))
(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
## [1] "a" "text" "about" "some" "political" "parties"
## [7] "called" "the" "NDP" "which" "is" "conservative"
## [13] "the" "NDP" "and" "the" "OLP" "and"
## [19] "OLP" "which" "are" "liberal" "helping" "poor"
## [25] "people"
所以这已经用派对密钥替换了派对名称差异。 从这个新标记构建一个 dfm 现在发生在这些新标记上,保留可能与情绪相关的(例如)"liberal" 的使用,但已经合并 "liberal party" 并将其替换为 "OLP".将字典应用于 dfm 现在将适用于 "liberal helping" 中的 "liberal" 示例,而不会将其与派对名称中的 "liberal" 混淆。
sentdict <- dictionary(list(
left = c("liberal", "left"),
right = c("conservative", "")
))
dfm(toks2) %>%
dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
## features
## docs olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
## text1 2 2 1 1 1 1 1 1 1 3 2 1 1 2 1 1 1
## features
## docs poor people
## text1 1 1
两个补充说明:
如果您不希望替换标记中的密钥大写,请设置
capkeys = FALSE
.您可以使用
valuetype
参数设置不同的匹配类型,包括valuetype = regex
。 (请注意,示例中的正则表达式可能格式不正确,因为 ndp 示例中|
运算符的范围将得到 "new democrats" OR "new" 然后是“民主党” . 但是有了tokens_lookup()
你就不用担心了!)