将阿拉伯语句子分成单词会导致具有不同功能的不同数量的单词

Question

我试图用 tm 和 tokenizers 包将一个阿拉伯语句子，《古兰经》第 38:1 节分开，但他们分别将句子分成 3 个和 4 个单词.有人可以解释（1）为什么会这样，以及（2）从 NLP 和阿拉伯语的角度来看，这种差异的含义是什么？另外，其中一个是错误的吗？我绝不是 NLP 或阿拉伯语方面的专家，但正在尝试运行代码。

这是我试过的代码：

library(tm)
library(tokenizers)
# Verse 38:1
verse<- "ص والقرآن ذي الذكر"

# This separates into to 3 words by tm library 
a <- colnames(DocumentTermMatrix(Corpus(VectorSource(verse) )))
a
# "الذكر"   "ذي"      "والقرآن"

# This separates into 4 words by 
b <- tokenizers::tokenize_words(verse)
b
# "ص"       "والقرآن" "ذي"      "الذكر"

我希望它们是相等的，但它们是不同的。谁能解释一下这是怎么回事？

Answer 1

它与 NLP 或阿拉伯语没有任何关系，只是您需要注意一些默认设置。 DocumentTermMatrix 有许多默认参数，可以通过 control 更改。运行 ?termFreq 全部查看。

其中一个默认值是 wordLengths:

An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.

因此，如果我们运行以下内容，我们会得到 3 个单词，因为删除的单词少于 3 个字符：

dtm <- DocumentTermMatrix(Corpus(VectorSource(verse)))
inspect(dtm)

#### OUTPUT ####

<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs الذكر ذي والقرآن
   1     1  1       1

到return所有单词，不分长短，我们需要把c(3, Inf)改成c(1, Inf)：

dtm <- DocumentTermMatrix(Corpus(VectorSource(verse)),
                          control = list(wordLengths = c(1, Inf))
                          )
inspect(dtm)

#### OUTPUT ####

<<DocumentTermMatrix (documents: 1, terms: 4)>>
Non-/sparse entries: 4/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs الذكر ذي ص والقرآن
   1     1  1 1       1

默认是有意义的，因为默认语言是英语，其中少于三个字符的词是冠词、介词等，但对于其他语言可能不太有意义。一定要花时间研究与不同分词器、语言设置等相关的其他参数。当前结果看起来不错，但随着文本变得更加复杂，您可能需要调整一些设置。

将阿拉伯语句子分成单词会导致具有不同功能的不同数量的单词

separating an Arabic sentence into words results in a different number of words with different functions

nlp

r

arabic

text-mining

arabic-support