如何获取维基项目的维基百科数据?
How to get wikipedia data of Wikiprojects?
我最近发现维基百科有 Wikiprojects
是根据 discipline
(https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline) 分类的。如 link 所示,它有 34 个学科。
我想知道是否可以获取与这些文章相关的所有维基百科文章 wikipedia disciplines
。
例如,考虑 WikiProject Computer science
。是否可以使用 WikiProject Computer science
类别获取所有与计算机科学相关的维基百科文章?如果有,是否有与之相关的数据转储,或者是否有其他方式获取这些数据?
我目前正在使用 python(即 pywikibot
和 pymediawiki
)。不过,我也很高兴收到其他语言的答案。
如果需要,我很乐意提供更多详细信息。
您可以使用 API:Categorymembers 获取子类别和页面的列表。将 "cmtype" 参数设置为 "subcat" 以获取子类别,将 "cmnamespace" 参数设置为“0”以获取文章。
您也可以从数据库中获取列表(categorylinks table and article information in page table中的类别层次结构信息)
正如我建议并添加到@arash 的回答中,您可以使用维基百科API 来获取维基百科数据。这是 link 以及如何执行此操作的说明,API:Categorymembers#GET_request
正如您所说,您需要使用程序获取数据,下面是JavaScript中的示例代码。它将从 Category:WikiProject_Computer_science_articles
中获取前 500 个名称并显示为输出。您可以根据此示例转换您选择的语言:
// Importing the module
const fetch = require('node-fetch');
// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
console.log(t.query.categorymembers[i].title);
}
});
要将数据写入文件,您可以像下面那样做:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = [];
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles[i] = title;
}
fs.writeFileSync('pathtotitles\titles.txt', titles);
});
上面的将数据存储在一个文件中,文件中用 ,
分隔,因为我们在那里使用 JavaScript 数组。如果你想在没有逗号的情况下存储每一行,那么你需要这样做:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
fs.writeFileSync('pathtotitles\titles.txt', titles);
});
通过使用 cmlimit
,我们无法获取超过 500 个标题,因此我们需要使用 cmcontinue
来检查和获取下一页...
试试下面的代码,它获取特定类别的所有标题并打印,将数据附加到文件中:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";
// Method to fetch and append the data to a file
var fetchTheData = async (url, index) => {
return await fetch(url).then(res => res.json()).then(data => {
// Getting the length of the returned array
let len = data.query.categorymembers.length;
// Initializing an empty string
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = data.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
// Appending to the file
fs.appendFileSync('pathtotitles\titles.txt', titles);
// Handling an end of error fetching titles exception
try {
return data.continue.cmcontinue;
} catch(err) {
return "===>>> Finished Fetching...";
}
});
}
// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
// Getting the next page token
let nextPage = await fetchTheData(url);
for(let i=1;i<=14;i++) {
await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
// Constructing the next page URL with next page token and sending the fetch request
nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
}
}
// Calling to begin extraction
constructNextPageURL(url);
希望对您有所帮助...
在我的 google 结果中看到了这个页面,我在这里留下一些工作代码供后代使用。这将直接与维基百科的 api 交互,不会使用 pywikibot 或 pymediawiki。
获取文章名称的过程分为两步。因为一个类别的成员不是文章本身,而是它们的讨论页。所以首先我们得到讨论页,然后我们必须得到父页面,实际的文章。
(有关 API 请求中使用的参数的更多信息,请查看 querying category members, and querying page info 的页面。)
import time
import requests
from datetime import datetime,timezone
import json
utc_time_now = datetime.now(timezone.utc)
utc_time_now_string =\
utc_time_now.replace(microsecond=0).replace(tzinfo=None).isoformat() + 'Z'
api_url = 'https://en.wikipedia.org/w/api.php'
headers = {'User-Agent': '<Your purpose>, owner_name: <Your name>,
email_id: <Your email id>'}
# or you can follow instructions at
# https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header
category = "Category:WikiProject_Computer_science_articles"
combined_category_members = []
params = {
'action': 'query',
'format': 'json',
'list':'categorymembers',
'cmtitle': category,
'cmprop': 'ids|title|timestamp',
'cmlimit': 500,
'cmstart': utc_time_now_string,
# you can also put a 'cmend': '20210101000000'
# (that YYYYMMDDHHMMSS string stands for 12 am UTC on Nov 1, 2021)
# this then gathers category members added from now till value for 'cmend'
'cmdir': 'older',
'cmnamespace': '0|1',
'cmsort': 'timestamp'
}
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)
while 'continue' in data:
params.update(data['continue'])
time.sleep(1)
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)
#now we've gotten only the talk page ids so far
#now we have to get the parent page ids from talk page ids
final_dict = {}
talk_page_id_list = []
for member in combined_category_members:
talk_page_id = member['pageid']
talk_page_id_list.append(talk_page_id)
while talk_page_id_list: #while not an empty list
fifty_pageid_batch = talk_page_id_list[0:50]
fifty_pageid_batch_converted = [str(number) for number in fifty_pageid_batch]
fifty_pageid_string = '|'.join(fifty_pageid_batch_converted)
params = {
'action': 'query',
'format': 'json',
'prop': 'info',
'pageids': fifty_pageid_string,
'inprop': 'subjectid|associatedpage'
}
time.sleep(1)
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
for talk_page_id, talk_page_id_dict in data['query']['pages'].items():
page_id_raw = talk_page_id_dict['subjectid']
page_id = str(page_id_raw)
page_title = talk_page_id_dict['associatedpage']
final_dict[page_id] = page_title
del talk_page_id_list[0:50]
with open('comp_sci_category_members.json', 'w', encoding='utf-8') as filex:
json.dump(final_dict, filex, ensure_ascii=False)
我最近发现维基百科有 Wikiprojects
是根据 discipline
(https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline) 分类的。如 link 所示,它有 34 个学科。
我想知道是否可以获取与这些文章相关的所有维基百科文章 wikipedia disciplines
。
例如,考虑 WikiProject Computer science
。是否可以使用 WikiProject Computer science
类别获取所有与计算机科学相关的维基百科文章?如果有,是否有与之相关的数据转储,或者是否有其他方式获取这些数据?
我目前正在使用 python(即 pywikibot
和 pymediawiki
)。不过,我也很高兴收到其他语言的答案。
如果需要,我很乐意提供更多详细信息。
您可以使用 API:Categorymembers 获取子类别和页面的列表。将 "cmtype" 参数设置为 "subcat" 以获取子类别,将 "cmnamespace" 参数设置为“0”以获取文章。
您也可以从数据库中获取列表(categorylinks table and article information in page table中的类别层次结构信息)
正如我建议并添加到@arash 的回答中,您可以使用维基百科API 来获取维基百科数据。这是 link 以及如何执行此操作的说明,API:Categorymembers#GET_request
正如您所说,您需要使用程序获取数据,下面是JavaScript中的示例代码。它将从 Category:WikiProject_Computer_science_articles
中获取前 500 个名称并显示为输出。您可以根据此示例转换您选择的语言:
// Importing the module
const fetch = require('node-fetch');
// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
console.log(t.query.categorymembers[i].title);
}
});
要将数据写入文件,您可以像下面那样做:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = [];
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles[i] = title;
}
fs.writeFileSync('pathtotitles\titles.txt', titles);
});
上面的将数据存储在一个文件中,文件中用 ,
分隔,因为我们在那里使用 JavaScript 数组。如果你想在没有逗号的情况下存储每一行,那么你需要这样做:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
fs.writeFileSync('pathtotitles\titles.txt', titles);
});
通过使用 cmlimit
,我们无法获取超过 500 个标题,因此我们需要使用 cmcontinue
来检查和获取下一页...
试试下面的代码,它获取特定类别的所有标题并打印,将数据附加到文件中:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";
// Method to fetch and append the data to a file
var fetchTheData = async (url, index) => {
return await fetch(url).then(res => res.json()).then(data => {
// Getting the length of the returned array
let len = data.query.categorymembers.length;
// Initializing an empty string
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = data.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
// Appending to the file
fs.appendFileSync('pathtotitles\titles.txt', titles);
// Handling an end of error fetching titles exception
try {
return data.continue.cmcontinue;
} catch(err) {
return "===>>> Finished Fetching...";
}
});
}
// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
// Getting the next page token
let nextPage = await fetchTheData(url);
for(let i=1;i<=14;i++) {
await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
// Constructing the next page URL with next page token and sending the fetch request
nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
}
}
// Calling to begin extraction
constructNextPageURL(url);
希望对您有所帮助...
在我的 google 结果中看到了这个页面,我在这里留下一些工作代码供后代使用。这将直接与维基百科的 api 交互,不会使用 pywikibot 或 pymediawiki。
获取文章名称的过程分为两步。因为一个类别的成员不是文章本身,而是它们的讨论页。所以首先我们得到讨论页,然后我们必须得到父页面,实际的文章。
(有关 API 请求中使用的参数的更多信息,请查看 querying category members, and querying page info 的页面。)
import time
import requests
from datetime import datetime,timezone
import json
utc_time_now = datetime.now(timezone.utc)
utc_time_now_string =\
utc_time_now.replace(microsecond=0).replace(tzinfo=None).isoformat() + 'Z'
api_url = 'https://en.wikipedia.org/w/api.php'
headers = {'User-Agent': '<Your purpose>, owner_name: <Your name>,
email_id: <Your email id>'}
# or you can follow instructions at
# https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header
category = "Category:WikiProject_Computer_science_articles"
combined_category_members = []
params = {
'action': 'query',
'format': 'json',
'list':'categorymembers',
'cmtitle': category,
'cmprop': 'ids|title|timestamp',
'cmlimit': 500,
'cmstart': utc_time_now_string,
# you can also put a 'cmend': '20210101000000'
# (that YYYYMMDDHHMMSS string stands for 12 am UTC on Nov 1, 2021)
# this then gathers category members added from now till value for 'cmend'
'cmdir': 'older',
'cmnamespace': '0|1',
'cmsort': 'timestamp'
}
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)
while 'continue' in data:
params.update(data['continue'])
time.sleep(1)
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)
#now we've gotten only the talk page ids so far
#now we have to get the parent page ids from talk page ids
final_dict = {}
talk_page_id_list = []
for member in combined_category_members:
talk_page_id = member['pageid']
talk_page_id_list.append(talk_page_id)
while talk_page_id_list: #while not an empty list
fifty_pageid_batch = talk_page_id_list[0:50]
fifty_pageid_batch_converted = [str(number) for number in fifty_pageid_batch]
fifty_pageid_string = '|'.join(fifty_pageid_batch_converted)
params = {
'action': 'query',
'format': 'json',
'prop': 'info',
'pageids': fifty_pageid_string,
'inprop': 'subjectid|associatedpage'
}
time.sleep(1)
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
for talk_page_id, talk_page_id_dict in data['query']['pages'].items():
page_id_raw = talk_page_id_dict['subjectid']
page_id = str(page_id_raw)
page_title = talk_page_id_dict['associatedpage']
final_dict[page_id] = page_title
del talk_page_id_list[0:50]
with open('comp_sci_category_members.json', 'w', encoding='utf-8') as filex:
json.dump(final_dict, filex, ensure_ascii=False)