raise_FirstSetError 在 SpaCy 主题建模中

Question

我想创建一个 LDA 主题模型，并且正在按照教程使用 SpaCy 来这样做。当我尝试使用 spacy 时收到的错误是我在 google 上找不到的错误，所以我希望这里有人知道它是关于什么的。

我是运行 Anaconda 上的这段代码：

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt

df = pd.DataFrame(data)

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  
 # deacc=True removes punctuations

data_words = list(sent_to_words(data))
print(data_words[:1])

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])

我收到以下错误：

File "C:\Users\maart\AppData\Local\Continuum\anaconda3\lib\site-packages\_regex_core.py", line 1880, in get_firstset
raise _FirstSetError()

_FirstSetError

错误一定是在词形还原之后的某处发生的，因为其他部分工作正常。

非常感谢！

Answer 1

我遇到了同样的问题，我可以通过卸载正则表达式（我安装了错误的版本）然后再次运行 python -m spacy download en 来解决它。这将重新安装正确版本的正则表达式。

raise_FirstSetError 在 SpaCy 主题建模中

raise_FirstSetError in SpaCy topic modeling

lda

spacy