如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有标记的列表?

How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?

我从 那里得到了一些指导。我首先制作如下索引。

import lucene
from  org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator

index_path = "./index"

lucene.initVM()

analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
    config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)

store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)

doc = Document()
doc.add(Field("docid", "1",  TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)

writer.close()
store.close()

然后我尝试获取一个字段的索引中的所有术语,如下所示。

store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)

尝试 1:尝试使用 中使用的 next(),这似乎是 TermsEnum 实现的 BytesRefIterator 方法。

for lrc in reader.leaves():
    terms = lrc.reader().terms('title')
    terms_enum = terms.iterator()
    while terms_enum.next():
        term = terms_enum.term()
        print(term.utf8ToString())

但是,我似乎无法访问那个 next() 方法。

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while terms_enum.next():
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

AttributeError: 'TermsEnum' object has no attribute 'next'

尝试 2:尝试按照 .

评论中的建议更改 while 循环
while next(terms_enum):
    term = terms_enum.term()
    print(term.utf8ToString())

但是,TermsEnum 似乎不被 Python 理解为迭代器。

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while next(terms_enum):
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

TypeError: 'TermsEnum' object is not an iterator

我知道我的问题可以按照 this question 中的建议得到回答。那么我想我的问题真的是,如何获得 TermsEnum?

中的所有条款

我发现下面的代码来自 heretest_Pylucene.py 文件中的 test_FieldEnumeration(),该文件位于 pylucene-8.6.1/test3/.

for term in BytesRefIterator.cast_(terms_enum):
    print(term.utf8ToString())

很高兴接受比这有更多解释的答案。