在主题建模中有错误的输出
Having a wrong Output in topic modelling
我在 python 中尝试过主题建模。但它显示错误的输出。
我在下面提供了示例和代码。
## Documents
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."
# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = "".join([i for i in doc.lower() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc) for doc in doc_complete]
#Preparing Document Term Matrix
import gensim
dictionary = corpora.Dictionary([doc_clean])
corpus = [dictionary.doc2bow(doc) for doc in [doc_clean]]
#Running LDA Model
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
print(ldamodel.print_topics(num_topics=3, num_words=3))
我得到如下输出:
[(0, u'0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*fher pen l f e rvng er run nce prcce + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (1, u'0.200*helh exper h ugr n g fr ur lfele + 0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (2, u'0.200*fher pen l f e rvng er run nce prcce + 0.200*ugr b cnue er lke hve ugr bu n fher + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer')]
我想知道,我错过了什么。谢谢
出现此问题是因为您在删除停用词之前没有标记您的文档。相反,您遍历每个字符并删除作为停用词的字符,例如"a", "i":
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> stop
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> "".join([i for i in doc.lower() if i not in stop])
'ugr b cnue. er lke hve ugr, bu n fher.'
您应该像这样处理停用词删除:
>>> from nltk import word_tokenize
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
'sugar bad consume . sister likes sugar , father .'
见Stopword removal with NLTK
其实你的预处理流水线是可以简化的。
>>> import gensim
>>> doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> doc2 = "My father spends a lot of time driving my sister around to dance practice."
>>> doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
>>> doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
>>> doc5 = "Health experts say that Sugar is not good for your lifestyle."
>>> documents = [doc1, doc2, doc3, doc4, doc5]
>>> texts = map(gensim.utils.lemmatize,documents)
>>> texts
[['sugar/NN', 'be/VB', 'bad/JJ', 'consume/VB', 'sister/NN', 'like/VB', 'have/VB', 'sugar/NN', 'not/RB', 'father/NN'], ['father/NN', 'spend/VB', 'lot/NN', 'time/NN', 'drive/VB', 'sister/NN', 'dance/VB', 'practice/NN'], ['doctor/NN', 'suggest/VB', 'drive/VB', 'cause/VB', 'increased/JJ', 'stress/NN', 'blood/NN', 'pressure/NN'], ['sometimes/RB', 'feel/JJ', 'pressure/NN', 'perform/VB', 'well/RB', 'school/NN', 'father/NN', 'never/RB', 'seem/VB', 'drive/VB', 'sister/NN', 'do/VB', 'better/JJ'], ['health/NN', 'expert/NN', 'say/VB', 'sugar/NN', 'be/VB', 'not/RB', 'good/JJ', 'lifestyle/NN']]
然后训练主题模型:
>>> dictionary = gensim.corpora.Dictionary(texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in texts]
>>> Lda = gensim.models.ldamodel.LdaModel
>>> ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
>>> ldamodel.print_topics()
[(0, u'0.067*drive/VB + 0.067*pressure/NN + 0.067*stress/NN + 0.067*blood/NN + 0.067*doctor/NN + 0.067*increased/JJ + 0.067*cause/VB + 0.067*suggest/VB + 0.017*sister/NN + 0.017*father/NN'), (1, u'0.078*sugar/NN + 0.054*not/RB + 0.054*be/VB + 0.054*father/NN + 0.054*sister/NN + 0.031*do/VB + 0.031*seem/VB + 0.031*school/NN + 0.031*well/RB + 0.031*better/JJ'), (2, u'0.067*drive/VB + 0.067*sister/NN + 0.067*father/NN + 0.067*lot/NN + 0.067*practice/NN + 0.067*dance/VB + 0.067*spend/VB + 0.067*time/NN + 0.017*pressure/NN + 0.017*expert/NN')]
我在 python 中尝试过主题建模。但它显示错误的输出。 我在下面提供了示例和代码。
## Documents
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."
# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = "".join([i for i in doc.lower() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc) for doc in doc_complete]
#Preparing Document Term Matrix
import gensim
dictionary = corpora.Dictionary([doc_clean])
corpus = [dictionary.doc2bow(doc) for doc in [doc_clean]]
#Running LDA Model
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
print(ldamodel.print_topics(num_topics=3, num_words=3))
我得到如下输出:
[(0, u'0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*fher pen l f e rvng er run nce prcce + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (1, u'0.200*helh exper h ugr n g fr ur lfele + 0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (2, u'0.200*fher pen l f e rvng er run nce prcce + 0.200*ugr b cnue er lke hve ugr bu n fher + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer')]
我想知道,我错过了什么。谢谢
出现此问题是因为您在删除停用词之前没有标记您的文档。相反,您遍历每个字符并删除作为停用词的字符,例如"a", "i":
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> stop
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> "".join([i for i in doc.lower() if i not in stop])
'ugr b cnue. er lke hve ugr, bu n fher.'
您应该像这样处理停用词删除:
>>> from nltk import word_tokenize
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
'sugar bad consume . sister likes sugar , father .'
见Stopword removal with NLTK
其实你的预处理流水线是可以简化的。
>>> import gensim
>>> doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> doc2 = "My father spends a lot of time driving my sister around to dance practice."
>>> doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
>>> doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
>>> doc5 = "Health experts say that Sugar is not good for your lifestyle."
>>> documents = [doc1, doc2, doc3, doc4, doc5]
>>> texts = map(gensim.utils.lemmatize,documents)
>>> texts
[['sugar/NN', 'be/VB', 'bad/JJ', 'consume/VB', 'sister/NN', 'like/VB', 'have/VB', 'sugar/NN', 'not/RB', 'father/NN'], ['father/NN', 'spend/VB', 'lot/NN', 'time/NN', 'drive/VB', 'sister/NN', 'dance/VB', 'practice/NN'], ['doctor/NN', 'suggest/VB', 'drive/VB', 'cause/VB', 'increased/JJ', 'stress/NN', 'blood/NN', 'pressure/NN'], ['sometimes/RB', 'feel/JJ', 'pressure/NN', 'perform/VB', 'well/RB', 'school/NN', 'father/NN', 'never/RB', 'seem/VB', 'drive/VB', 'sister/NN', 'do/VB', 'better/JJ'], ['health/NN', 'expert/NN', 'say/VB', 'sugar/NN', 'be/VB', 'not/RB', 'good/JJ', 'lifestyle/NN']]
然后训练主题模型:
>>> dictionary = gensim.corpora.Dictionary(texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in texts]
>>> Lda = gensim.models.ldamodel.LdaModel
>>> ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
>>> ldamodel.print_topics()
[(0, u'0.067*drive/VB + 0.067*pressure/NN + 0.067*stress/NN + 0.067*blood/NN + 0.067*doctor/NN + 0.067*increased/JJ + 0.067*cause/VB + 0.067*suggest/VB + 0.017*sister/NN + 0.017*father/NN'), (1, u'0.078*sugar/NN + 0.054*not/RB + 0.054*be/VB + 0.054*father/NN + 0.054*sister/NN + 0.031*do/VB + 0.031*seem/VB + 0.031*school/NN + 0.031*well/RB + 0.031*better/JJ'), (2, u'0.067*drive/VB + 0.067*sister/NN + 0.067*father/NN + 0.067*lot/NN + 0.067*practice/NN + 0.067*dance/VB + 0.067*spend/VB + 0.067*time/NN + 0.017*pressure/NN + 0.017*expert/NN')]