如何识别 python 中的维基百科类别
How to identify wikipedia categories in python
我目前正在使用 pywikibot
获取给定维基百科页面(例如,support-vector machine
)的类别,如下所示。
import pywikibot as pw
print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])
我得到的结果是:
[
'Category:All articles with specifically marked weasel-worded phrases',
'Category:All articles with unsourced statements',
'Category:Articles with specifically marked weasel-worded phrases from May 2018',
'Category:Articles with unsourced statements from June 2013',
'Category:Articles with unsourced statements from March 2017',
'Category:Articles with unsourced statements from March 2018',
'Category:CS1 maint: Uses editors parameter',
'Category:Classification algorithms',
'Category:Statistical classification',
'Category:Support vector machines',
'Category:Wikipedia articles needing clarification from November 2017',
'Category:Wikipedia articles with BNF identifiers',
'Category:Wikipedia articles with GND identifiers',
'Category:Wikipedia articles with LCCN identifiers'
]
如您所见,我得到的结果包括维基百科的许多跟踪和维护类别,例如;
- 类别:所有带有特别标记的狡猾措辞的文章
- 类别:所有带有未来源陈述的文章
- 类别:CS1 维护:使用编辑器参数
- 等等
但是,我只感兴趣的类别是;
- 类别:分类算法
- 类别:统计分类
- 类别:支持向量机
我想知道是否有办法获取所有 tracing or maintenance
维基百科类别,以便我可以从结果中删除它们以仅获取信息类别。
或者,如果有任何其他方法可以从结果中消除它们,请建议我。
如果需要,我很乐意提供更多详细信息。
pywikibot
目前不提供部分API features过滤隐藏分类。您可以通过在 categoryinfo
:
中搜索 hidden
键来手动执行此操作
import pywikibot as pw
site = pw.Site('en', 'wikipedia')
print([
cat.title()
for cat in pw.Page(site, 'support-vector machine').categories()
if 'hidden' not in cat.categoryinfo
])
给出:
['Category:Classification algorithms',
'Category:Statistical classification',
'Category:Support vector machines']
有关详细信息,请参阅 https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories and https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories。
我目前正在使用 pywikibot
获取给定维基百科页面(例如,support-vector machine
)的类别,如下所示。
import pywikibot as pw
print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])
我得到的结果是:
[
'Category:All articles with specifically marked weasel-worded phrases',
'Category:All articles with unsourced statements',
'Category:Articles with specifically marked weasel-worded phrases from May 2018',
'Category:Articles with unsourced statements from June 2013',
'Category:Articles with unsourced statements from March 2017',
'Category:Articles with unsourced statements from March 2018',
'Category:CS1 maint: Uses editors parameter',
'Category:Classification algorithms',
'Category:Statistical classification',
'Category:Support vector machines',
'Category:Wikipedia articles needing clarification from November 2017',
'Category:Wikipedia articles with BNF identifiers',
'Category:Wikipedia articles with GND identifiers',
'Category:Wikipedia articles with LCCN identifiers'
]
如您所见,我得到的结果包括维基百科的许多跟踪和维护类别,例如;
- 类别:所有带有特别标记的狡猾措辞的文章
- 类别:所有带有未来源陈述的文章
- 类别:CS1 维护:使用编辑器参数
- 等等
但是,我只感兴趣的类别是;
- 类别:分类算法
- 类别:统计分类
- 类别:支持向量机
我想知道是否有办法获取所有 tracing or maintenance
维基百科类别,以便我可以从结果中删除它们以仅获取信息类别。
或者,如果有任何其他方法可以从结果中消除它们,请建议我。
如果需要,我很乐意提供更多详细信息。
pywikibot
目前不提供部分API features过滤隐藏分类。您可以通过在 categoryinfo
:
hidden
键来手动执行此操作
import pywikibot as pw
site = pw.Site('en', 'wikipedia')
print([
cat.title()
for cat in pw.Page(site, 'support-vector machine').categories()
if 'hidden' not in cat.categoryinfo
])
给出:
['Category:Classification algorithms',
'Category:Statistical classification',
'Category:Support vector machines']
有关详细信息,请参阅 https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories and https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories。