有兴趣在维基百科 xml 转储中搜索仅与医学相关的术语

Interested in searching wikipedia xml dump for only medically-related terms

我想自动定义医学术语。然而,标准的医学词典和 WordNet 是不够的。因此,我 downloaded the wikipedia corpus to use instead. However, when I downloaded enwiki-latest-pages-articles.xml (which, incidentally, begins with the word "anarchism"--why not something like "AA"?) I immediately failed with grep due to the size of the file, and began looking online. I discovered what I thought to be already-written libraries for this, like Perl's MediaWiki::DumpFile (I do know some Perl, but I would prefer Python because that's what my script is written in), but it looks like most of them create or require some kind of database (I just want to (albeit fuzzily) match a word and grab the first few sentences of its introductory paragraph; e.g., a search for 'salmonella' 会 return :

Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.[1].

就我的目的而言(只是将其用作一种词汇表),这些脚本是我想要的吗(我发现没有示例很难理解文档)?例如,我想:

  1. 只是为了减少搜索 material,删除所有与医学无关的内容(我尝试使用 category 过滤器,因为维基百科允许导出特定类别,但他们没有'按我的意愿工作;例如,'Medicine' 只会 return 大约 20 页,所以我更愿意以某种方式处理 xml 文件)。

  2. 允许我的 Python 脚本快速搜索维基百科语料库(例如,如果我想匹配 CHOLERAE 我希望它带我到 Vibrio cholerae 与实际的维基百科搜索功能一样(只需带我到首选)。我已经编写了一种可以执行此操作的搜索引擎,但是对于这么大的文件(40 GB)它会很慢。

对于这个可能非常幼稚的问题提前表示歉意。

这是一种无需下载整个内容即可查询维基百科数据库的方法。

import requests
import argparse

parser = argparse.ArgumentParser(description='Fetch wikipedia extracts.')
parser.add_argument('word', help='word to define')
args = parser.parse_args()

proxies = {
    # See http://www.mediawiki.org/wiki/API:Main_page#API_etiquette
    # "http": "http://localhost:3128",
}

headers = {
    # http://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
    "User-Agent": "Definitions/1.0 (Contact rob@example.com for info.)"
}

params = {
    'action':'query',
    'prop':'extracts',
    'format':'json',
    'exintro':1,
    'explaintext':1,
    'generator':'search',
    'gsrsearch':args.word,
    'gsrlimit':1,
    'continue':''
}

r = requests.get('http://en.wikipedia.org/w/api.php',
                 params=params,
                 headers=headers,
                 proxies=proxies)
json = r.json()
if "query" in json:
    result = json["query"]["pages"].items()[0][1]["extract"]
    print result.encode('utf-8')
else:
    print "No definition."

这是一些结果。请注意,即使单词拼写错误,它仍然 returns 结果。

$ python define.py CHOLERAE
Vibrio cholerae is a Gram-negative, comma-shaped bacterium. Some strains of V. cholerae cause the disease cholera. V. cholerae is a facultative anaerobic organism and has a flagellum at one cell pole. V. cholerae was first isolated as the cause of cholera by Italian anatomist Filippo Pacini in 1854, but his discovery was not widely known until Robert Koch, working independently 30 years later, publicized the knowledge and the means of fighting the disease.
$ python define.py salmonella
Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.
Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.
$ python define.py salmanela
Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.
Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.