将一个句子分为多个类别
Classify a sentence into multiple categories
NLTK 和 Scikit 的初学者 - 在此处学习。我希望能够将给定的句子(甚至段落)归类为一组类别。按类别,我并不是说只有两个类别,如垃圾邮件和非垃圾邮件或好情绪和坏情绪,这意味着它有多个(超过两个)类别可供选择。请帮助我选择最简单的算法来解决这个问题。提前致谢。
根据您在 post 中使用的标签,我知道您知道 machine learning
...这是完成此项目的好方法。
您需要的是相当数量的示例数据,即 table 文本(示例句子、段落等...),然后是一列说明类别的列在.
你做的是train
程序,在示例文本中寻找模式,如果你有足够的示例数据,你就可以analyze
文本,让程序输出什么它是类别。
您可以使用 TensorFlow 作为您的机器学习框架。
我建议您从一些更简单的项目开始,以了解机器学习的工作原理和最佳效果。
如果我是对的,那么您正在尝试对您的数据集执行主题建模。
就我而言,你可以使用LDA(Latent Dirichlet allocation),但是你有义务指定主题数,你可以做几个测试来找到合适的主题数值。
这是使用 python 执行的 LDA 示例,演示了如何检查路透社新闻数据集子集的模型。下面的输入 X 是文档术语矩阵 。
>>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches
NLTK 和 Scikit 的初学者 - 在此处学习。我希望能够将给定的句子(甚至段落)归类为一组类别。按类别,我并不是说只有两个类别,如垃圾邮件和非垃圾邮件或好情绪和坏情绪,这意味着它有多个(超过两个)类别可供选择。请帮助我选择最简单的算法来解决这个问题。提前致谢。
根据您在 post 中使用的标签,我知道您知道 machine learning
...这是完成此项目的好方法。
您需要的是相当数量的示例数据,即 table 文本(示例句子、段落等...),然后是一列说明类别的列在.
你做的是train
程序,在示例文本中寻找模式,如果你有足够的示例数据,你就可以analyze
文本,让程序输出什么它是类别。
您可以使用 TensorFlow 作为您的机器学习框架。
我建议您从一些更简单的项目开始,以了解机器学习的工作原理和最佳效果。
如果我是对的,那么您正在尝试对您的数据集执行主题建模。 就我而言,你可以使用LDA(Latent Dirichlet allocation),但是你有义务指定主题数,你可以做几个测试来找到合适的主题数值。 这是使用 python 执行的 LDA 示例,演示了如何检查路透社新闻数据集子集的模型。下面的输入 X 是文档术语矩阵 。
>>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches