为什么 text2vec 显示的文件比实际存在的多？

Question

我正在测试text2vec。一个目录下只有2个文件（1.txt，2.txt，非常小，每个大约20k）。我想测试它们的相似性。不知道为什么说54个文件

> library(stringr)
>  library(NLP)
>  library(tm)
>  library(text2vec)


>  filedir="F:\0 R\similarity test\corpus"
>  prep_fun = function(x) {
+     x %>% 
+     # make text lower case
+     str_to_lower %>% 
+     # remove non-alphanumeric symbols
+     str_replace_all("[^[:alnum:]]", " ") %>% 
+     # collapse multiple spaces
+     str_replace_all("\s+", " ")
+  }
>  allfile=idir(filedir)
>  #files=list.files(path=filedir, full.names=T)
>  #allfile=ifiles(files)
>  it=itoken(allfile, preprocessor=prep_fun, progressbar=F)
>  stopwrd=stopwords("en")
>  v=create_vocabulary(it, stopwords=stopwrd)
> v
Number of docs: 54 
174 stopwords: i, me, my, myself, we, our ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
          term term_count doc_count
  1:     house          2         2
  2: 224161072          2         2
  3:  suggests          2         2
  4:   remains          2         2
  5: published          2         2
 ---                               
338:      year         14         6
339:       nep         16        12
340:      will         16        10
341:   chinese         20        12
342:     malay         20        10
>

我把数据导出成csv，发现新的文件名叫做：

1.txt_1
1.txt_2
1.txt_3
1.txt_4
...

...

如果我使用

#files=list.files(path=filedir, full.names=T)
#allfile=ifiles(files)

它仍然说 54 个文档

并且它们之间也有相似性度量。大部分都是0相似度。

请让我知道是否应该是这种情况或者曾经是什么情况。

我想要的只是 1.txt 和 2.txt 的一个相似性度量，并输出只包含这两个文件的度量的矩阵。

Answer 1

text2vec 将每个文件中的每一行视为一个单独的文档。在您的情况下，我建议为 idir/ifiles 函数提供另一个 reader 函数。 Reader 应该只读取整个文件并将行折叠成一个字符串。 (例如reader = function (x) paste(readLines(x), collapse=' '))

为什么 text2vec 显示的文件比实际存在的多？

Why does text2vec show more files than actually exist?

r

similarity

text2vec