如何识别 python 中的维基百科类别

How to identify wikipedia categories in python

我目前正在使用 pywikibot 获取给定维基百科页面(例如,support-vector machine)的类别,如下所示。

import pywikibot as pw

print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])

我得到的结果是:

[
  'Category:All articles with specifically marked weasel-worded phrases',
  'Category:All articles with unsourced statements',
  'Category:Articles with specifically marked weasel-worded phrases from May 2018',
  'Category:Articles with unsourced statements from June 2013',
  'Category:Articles with unsourced statements from March 2017',
  'Category:Articles with unsourced statements from March 2018',
  'Category:CS1 maint: Uses editors parameter',
  'Category:Classification algorithms',
  'Category:Statistical classification',
  'Category:Support vector machines',
  'Category:Wikipedia articles needing clarification from November 2017',
  'Category:Wikipedia articles with BNF identifiers',
  'Category:Wikipedia articles with GND identifiers',
  'Category:Wikipedia articles with LCCN identifiers'
]

如您所见,我得到的结果包括维基百科的许多跟踪和维护类别,例如;

但是,我只感兴趣的类别是;

我想知道是否有办法获取所有 tracing or maintenance 维基百科类别,以便我可以从结果中删除它们以仅获取信息类别。

或者,如果有任何其他方法可以从结果中消除它们,请建议我。

如果需要,我很乐意提供更多详细信息。

pywikibot目前不提供部分API features过滤隐藏分类。您可以通过在 categoryinfo:

中搜索 hidden 键来手动执行此操作
import pywikibot as pw

site = pw.Site('en', 'wikipedia')
print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

给出:

['Category:Classification algorithms', 
 'Category:Statistical classification', 
 'Category:Support vector machines']

有关详细信息,请参阅 https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories and https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories