在 TDM 中加入相邻的单词(标记)以进行整洁的分析
Joining adjacent words (tokens) in a TDM for tidy analysis
我有包含类似于以下字符串的文档:
textForAnalysis <- c("non-ifrs earnings numbers are report to be...")
其中我输入了一个语料库
textCorpus <- Corpus(VectorSource(textForAnalysis))
然后转换为 TDM
textTDM <- TermDocumentMatrix(textCorpus)
然后将TDM翻译成整洁的格式进行分析
textTidy <- tidy(textTDM)
当我打印文本时,一切正常,
textTidy
> textTidy
# A tibble: 6 × 3
term document count
<chr> <chr> <dbl>
1 are 1 1
2 earnings 1 1
3 ifrs 1 1
4 non 1 1
5 numbers 1 1
6 report 1 1
除此之外,我想将 "non-ifrs" 项目保留为单个标记(单词)。我不想将 "non-ifrs" 短语分成 "non" 和 "ifrs"。
如何保持相邻的措辞,例如"non-ifrs" 在我的 analysis/TDM?
中作为单个 "term"(非国际财务报告准则)
TermDocumentMatrix
的文档中有一段话可能是关键:
This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp) and takes no custom functions as option arguments.
你有作业
textCorpus <- Corpus(VectorSource(textForAnalysis))
正如您从 class(textCorpus)
中看到的,该变量是 SimpleCorpus
.
的一个实例
请使用V语料库代替语料库:
textCorpus <- VCorpus(VectorSource(textForAnalysis))
现在您可以应用所有必要的控制参数:
textTDM <- TermDocumentMatrix(
textCorpus,
control=list(removePunctuation=list(preserve_intra_word_dashes = TRUE))
)
结果是:
(textTidy <- tidy(textTDM))
# A tibble: 5 × 3
term document count
<chr> <chr> <dbl>
1 are 1 1
2 earnings 1 1
3 non-ifrs 1 1
4 numbers 1 1
5 report 1 1
我有包含类似于以下字符串的文档:
textForAnalysis <- c("non-ifrs earnings numbers are report to be...")
其中我输入了一个语料库
textCorpus <- Corpus(VectorSource(textForAnalysis))
然后转换为 TDM
textTDM <- TermDocumentMatrix(textCorpus)
然后将TDM翻译成整洁的格式进行分析
textTidy <- tidy(textTDM)
当我打印文本时,一切正常,
textTidy
> textTidy
# A tibble: 6 × 3
term document count
<chr> <chr> <dbl>
1 are 1 1
2 earnings 1 1
3 ifrs 1 1
4 non 1 1
5 numbers 1 1
6 report 1 1
除此之外,我想将 "non-ifrs" 项目保留为单个标记(单词)。我不想将 "non-ifrs" 短语分成 "non" 和 "ifrs"。
如何保持相邻的措辞,例如"non-ifrs" 在我的 analysis/TDM?
中作为单个 "term"(非国际财务报告准则)TermDocumentMatrix
的文档中有一段话可能是关键:
This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp) and takes no custom functions as option arguments.
你有作业
textCorpus <- Corpus(VectorSource(textForAnalysis))
正如您从 class(textCorpus)
中看到的,该变量是 SimpleCorpus
.
请使用V语料库代替语料库:
textCorpus <- VCorpus(VectorSource(textForAnalysis))
现在您可以应用所有必要的控制参数:
textTDM <- TermDocumentMatrix(
textCorpus,
control=list(removePunctuation=list(preserve_intra_word_dashes = TRUE))
)
结果是:
(textTidy <- tidy(textTDM))
# A tibble: 5 × 3
term document count
<chr> <chr> <dbl>
1 are 1 1
2 earnings 1 1
3 non-ifrs 1 1
4 numbers 1 1
5 report 1 1