如何查询特定年份的 arXiv?
How to query arXiv for a specific year?
我正在使用下面显示的代码从 arXiv 检索论文。我想检索标题中包含“机器”和“学习”字样的论文。论文数量较多,想实现按年份分片(published
)。
如何在 search_query
中请求 2020 年和 2019 年的记录?请注意,我对 post-filtering.
不感兴趣
import urllib.request
import time
import feedparser
# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
start = 0
total_results = 5000
results_per_iteration = 1000
wait_time = 3
papers = []
print('Searching arXiv for %s' % search_query)
for i in range(start,total_results,results_per_iteration):
print("Results %i - %i" % (i,i+results_per_iteration))
query = 'search_query=%s&start=%i&max_results=%i' % (search_query,
i,
results_per_iteration)
# perform a GET request using the base_url and query
response = urllib.request.urlopen(base_url+query).read()
# parse the response using feedparser
feed = feedparser.parse(response)
# Run through each entry, and print out information
for entry in feed.entries:
#print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
#print('Title: %s' % entry.title)
#feedparser v4.1 only grabs the first author
#print('First Author: %s' % entry.author)
paper = {}
paper["date"] = entry.published
paper["title"] = entry.title
paper["first_author"] = entry.author
paper["summary"] = entry.summary
papers.append(paper)
# Sleep a bit before calling the API again
print('Bulk: %i' % 1)
time.sleep(wait_time)
根据 arXiv documentation,没有 published
或 date
字段可用。
您可以做的是 sort the results 按日期(通过将 &sortBy=submittedDate&sortOrder=descending
添加到您的查询参数)并在您到达 2018 年时停止发出请求。
基本上你的代码应该这样修改:
import urllib.request
import time
import feedparser
# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
i = 0
results_per_iteration = 1000
wait_time = 3
papers = []
year = ""
print('Searching arXiv for %s' % search_query)
while (year != "2018"): #stop requesting when papers date reach 2018
print("Results %i - %i" % (i,i+results_per_iteration))
query = 'search_query=%s&start=%i&max_results=%i&sortBy=submittedDate&sortOrder=descending' % (search_query,
i,
results_per_iteration)
# perform a GET request using the base_url and query
response = urllib.request.urlopen(base_url+query).read()
# parse the response using feedparser
feed = feedparser.parse(response)
# Run through each entry, and print out information
for entry in feed.entries:
#print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
#print('Title: %s' % entry.title)
#feedparser v4.1 only grabs the first author
#print('First Author: %s' % entry.author)
paper = {}
paper["date"] = entry.published
year = paper["date"][0:4]
paper["title"] = entry.title
paper["first_author"] = entry.author
paper["summary"] = entry.summary
papers.append(paper)
# Sleep a bit before calling the API again
print('Bulk: %i' % 1)
i += results_per_iteration
time.sleep(wait_time)
对于“post-filtering”方法,一旦收集到足够多的结果,我就会这样做:
papers2019 = [item for item in papers if item["date"][0:4] == "2019"]
我正在使用下面显示的代码从 arXiv 检索论文。我想检索标题中包含“机器”和“学习”字样的论文。论文数量较多,想实现按年份分片(published
)。
如何在 search_query
中请求 2020 年和 2019 年的记录?请注意,我对 post-filtering.
import urllib.request
import time
import feedparser
# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
start = 0
total_results = 5000
results_per_iteration = 1000
wait_time = 3
papers = []
print('Searching arXiv for %s' % search_query)
for i in range(start,total_results,results_per_iteration):
print("Results %i - %i" % (i,i+results_per_iteration))
query = 'search_query=%s&start=%i&max_results=%i' % (search_query,
i,
results_per_iteration)
# perform a GET request using the base_url and query
response = urllib.request.urlopen(base_url+query).read()
# parse the response using feedparser
feed = feedparser.parse(response)
# Run through each entry, and print out information
for entry in feed.entries:
#print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
#print('Title: %s' % entry.title)
#feedparser v4.1 only grabs the first author
#print('First Author: %s' % entry.author)
paper = {}
paper["date"] = entry.published
paper["title"] = entry.title
paper["first_author"] = entry.author
paper["summary"] = entry.summary
papers.append(paper)
# Sleep a bit before calling the API again
print('Bulk: %i' % 1)
time.sleep(wait_time)
根据 arXiv documentation,没有 published
或 date
字段可用。
您可以做的是 sort the results 按日期(通过将 &sortBy=submittedDate&sortOrder=descending
添加到您的查询参数)并在您到达 2018 年时停止发出请求。
基本上你的代码应该这样修改:
import urllib.request
import time
import feedparser
# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
i = 0
results_per_iteration = 1000
wait_time = 3
papers = []
year = ""
print('Searching arXiv for %s' % search_query)
while (year != "2018"): #stop requesting when papers date reach 2018
print("Results %i - %i" % (i,i+results_per_iteration))
query = 'search_query=%s&start=%i&max_results=%i&sortBy=submittedDate&sortOrder=descending' % (search_query,
i,
results_per_iteration)
# perform a GET request using the base_url and query
response = urllib.request.urlopen(base_url+query).read()
# parse the response using feedparser
feed = feedparser.parse(response)
# Run through each entry, and print out information
for entry in feed.entries:
#print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
#print('Title: %s' % entry.title)
#feedparser v4.1 only grabs the first author
#print('First Author: %s' % entry.author)
paper = {}
paper["date"] = entry.published
year = paper["date"][0:4]
paper["title"] = entry.title
paper["first_author"] = entry.author
paper["summary"] = entry.summary
papers.append(paper)
# Sleep a bit before calling the API again
print('Bulk: %i' % 1)
i += results_per_iteration
time.sleep(wait_time)
对于“post-filtering”方法,一旦收集到足够多的结果,我就会这样做:
papers2019 = [item for item in papers if item["date"][0:4] == "2019"]