仅限 FlickrAPI returns 个不完整的结果
FlickrAPI only returns an incomplete number of results
我的目标是从德国科隆市地理边界内的 Flickr 数据库中提取地理数据(纬度和经度值)、视图、照片 ID、url 和发布日期。然后将数据写入 csv 文件。仅使用 tags='Köln'
的结果总数约为 110.000。我想从中提取 至少 一个 5 位数的数据点。为此,我设置了三个分隔符:标签、最大上传日期和最小上传日期。
已生效:数据已成功写入csv。
什么还不起作用:当我使用 xml.etree.ElementTree.dump()
return 搜索结果时,我可以看到针对各个搜索参数找到了大约 3,700 个结果。据我所知,这个数字在 Flickr 设置的每个查询 4,000 个结果的限制范围内。但是,只有 700 到 1,000 个数据点被写入 csv 文件。这个数字永远不会相同,并且会随着每次执行而变化,这很奇怪,因为我明确定义了时间范围。此外,尽管在使用 time.sleep(1)
的调用之间添加了一个计时器,但我仍然时不时地被服务器踢出(错误代码 500)。在几乎没有记录的限制下苦苦挣扎之后,我真的不知道为什么我的代码仍然没有按预期工作。
我使用的代码如下:
import flickrapi
import os
import datetime
import time
## Only needed to explore the xml tree
# import xml
## API key and secret provided by Flickr
api_key = 'api key'
api_secret = 'api secret'
## Approximate geographic coordinates of the administrational boundaries of the city of Cologne
boundaries = '6.8064182,50.8300729,7.1528453,51.0837915'
## Counter for the ID column required by GIS software
id_count = 1
## Creation of an editable csv file and its top row
csv = open('flickr_data.csv', mode='a')
if(os.stat('flickr_data.csv').st_size == 0):
csv.write('ID,Photo_ID,Lat,Lon,Views,Taken_Unix,Taken,URL \n')
## Authentication of the Flickr API
flickr = flickrapi.FlickrAPI(api_key, api_secret)
## Page counter
page_number = 1
## Only needed to explore the xml tree
# test_list = flickr.photos_search(max_upload_date = '2020-07-09 23:59:59',min_upload_date = '2020-01-15 0:00:00',tags = 'Köln',bbox = boundaries,has_geo = '1',page = 1,extras = 'views',per_page = '250')
# xml.etree.ElementTree.dump(test_list)
## While loop keeps running until page 16 is reached. The total number of pages for the wanted search query is 452.
## However, Flickr only returns a number of photos equivalent to 16 pages of 250 results.
## At this point, the code is reiterated until the maximum number of pages is reached.
while page_number < 17:
## Flickr search for the geographic boundaries of Cologne, Germany.
photo_list = flickr.photos_search(tags = 'Köln',
max_upload_date = '2020-07-09 23:59:59',
min_upload_date = '2020-01-15 00:00:00',
bbox = boundaries,
has_geo = '1',
page = page_number,
extras = 'views',
per_page = '250') ## maximum allowed photos per page for bbox-delimited requests
## For loop keeps running as long as there are photos on a page
for photo in photo_list[0]:
## extraction of latitude and longitude data from the search results
geodata = flickr.photos_geo_getLocation(photo_id = photo.attrib['id'])
lat = geodata[0][0].attrib['latitude']
lon = geodata[0][0].attrib['longitude']
## extraction of views from the search results
views = photo.get('views')
## extraction and conversion of upload dates
photo_info = flickr.photos.getInfo(photo_id = photo.attrib['id'])
date_unix = int(photo_info[0][4].attrib['posted'])
date = datetime.datetime.utcfromtimestamp(date_unix).strftime('%Y-%m-%d %H:%M:%S')
url = 'https://www.flickr.com/photos/' + photo.attrib['owner'] + '/' + photo.attrib['id']
## the csv is filled with the acquired information
csv.write('%s,%s,%s,%s,%s,%s,%s,%s \n' % (id_count,
photo.attrib['id'],
lat,
lon,
views,
date_unix,
date,
url))
id_count += 1
## 1 second wait time between calls to prevent error code 500
time.sleep(1)
## Turns the page
page_number += page_number
## Total number of photos searched
print(sum(1 for line in open(flickr_data.csv))-1)
csv.close()
以下是 xml 的节选,return 由 flickr.photos_search
编辑
<rsp stat="ok">
<photos page="1" pages="16" perpage="250" total="3755">
<photo id="50094525552" owner="98355876@N00" secret="6d66d421af" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="250" />
<photo id="50093709173" owner="98355876@N00" secret="90c31cac1d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="260" />
<photo id="50093706783" owner="98355876@N00" secret="9521b8ba7d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="224" />
<photo id="50093641658" owner="82692690@N02" secret="e26afb1e79" server="65535" farm="66" title="Cabecera. Catedral gótica de Colonia. JX3." ispublic="1" isfriend="0" isfamily="0" views="201" />
<photo id="50090280721" owner="98355876@N00" secret="cc0e2d7b8b" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="295" />
<photo id="50090278631" owner="98355876@N00" secret="8113aaa628" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="280" />
<photo id="50090277186" owner="98355876@N00" secret="73753c811d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="320" />
<photo id="50090150901" owner="136678496@N04" secret="6de14ca572" server="65535" farm="66" title="Good Morning" ispublic="1" isfriend="0" isfamily="0" views="104" />
<photo id="50089819277" owner="7283893@N05" secret="43e5290b07" server="65535" farm="66" title="Der Chef / The Boss" ispublic="1" isfriend="0" isfamily="0" views="421" />
以下是脚本的输出,每个 for 循环结束时打印 ID 计数,每个 while 循环打印页码:
1
2
3
4
5
6
7
8
9
(...)
245
246
247
248
249
250
-------- PAGE 2 --------
251
252
253
254
255
256
257
(...)
493
494
495
496
497
498
499
500
-------- PAGE 4 --------
501
502
503
504
505
506
507
(...)
743
744
745
746
747
748
749
750
-------- PAGE 8 --------
751
752
753
754
755
756
757
758
759
(...)
990
991
992
993
994
995
996
997
998
999
1000
-------- PAGE 16 --------
1001
1002
1003
1004
1005
-------- PAGE 32 --------
正如您所发现的,pagenumber
由于迭代器在 while
循环结束时跳过了很多 API 结果,所以翻倍了:
page_number += page_number
要修复,只需调整为递增 1:
page_number += 1
我的目标是从德国科隆市地理边界内的 Flickr 数据库中提取地理数据(纬度和经度值)、视图、照片 ID、url 和发布日期。然后将数据写入 csv 文件。仅使用 tags='Köln'
的结果总数约为 110.000。我想从中提取 至少 一个 5 位数的数据点。为此,我设置了三个分隔符:标签、最大上传日期和最小上传日期。
已生效:数据已成功写入csv。
什么还不起作用:当我使用 xml.etree.ElementTree.dump()
return 搜索结果时,我可以看到针对各个搜索参数找到了大约 3,700 个结果。据我所知,这个数字在 Flickr 设置的每个查询 4,000 个结果的限制范围内。但是,只有 700 到 1,000 个数据点被写入 csv 文件。这个数字永远不会相同,并且会随着每次执行而变化,这很奇怪,因为我明确定义了时间范围。此外,尽管在使用 time.sleep(1)
的调用之间添加了一个计时器,但我仍然时不时地被服务器踢出(错误代码 500)。在几乎没有记录的限制下苦苦挣扎之后,我真的不知道为什么我的代码仍然没有按预期工作。
我使用的代码如下:
import flickrapi
import os
import datetime
import time
## Only needed to explore the xml tree
# import xml
## API key and secret provided by Flickr
api_key = 'api key'
api_secret = 'api secret'
## Approximate geographic coordinates of the administrational boundaries of the city of Cologne
boundaries = '6.8064182,50.8300729,7.1528453,51.0837915'
## Counter for the ID column required by GIS software
id_count = 1
## Creation of an editable csv file and its top row
csv = open('flickr_data.csv', mode='a')
if(os.stat('flickr_data.csv').st_size == 0):
csv.write('ID,Photo_ID,Lat,Lon,Views,Taken_Unix,Taken,URL \n')
## Authentication of the Flickr API
flickr = flickrapi.FlickrAPI(api_key, api_secret)
## Page counter
page_number = 1
## Only needed to explore the xml tree
# test_list = flickr.photos_search(max_upload_date = '2020-07-09 23:59:59',min_upload_date = '2020-01-15 0:00:00',tags = 'Köln',bbox = boundaries,has_geo = '1',page = 1,extras = 'views',per_page = '250')
# xml.etree.ElementTree.dump(test_list)
## While loop keeps running until page 16 is reached. The total number of pages for the wanted search query is 452.
## However, Flickr only returns a number of photos equivalent to 16 pages of 250 results.
## At this point, the code is reiterated until the maximum number of pages is reached.
while page_number < 17:
## Flickr search for the geographic boundaries of Cologne, Germany.
photo_list = flickr.photos_search(tags = 'Köln',
max_upload_date = '2020-07-09 23:59:59',
min_upload_date = '2020-01-15 00:00:00',
bbox = boundaries,
has_geo = '1',
page = page_number,
extras = 'views',
per_page = '250') ## maximum allowed photos per page for bbox-delimited requests
## For loop keeps running as long as there are photos on a page
for photo in photo_list[0]:
## extraction of latitude and longitude data from the search results
geodata = flickr.photos_geo_getLocation(photo_id = photo.attrib['id'])
lat = geodata[0][0].attrib['latitude']
lon = geodata[0][0].attrib['longitude']
## extraction of views from the search results
views = photo.get('views')
## extraction and conversion of upload dates
photo_info = flickr.photos.getInfo(photo_id = photo.attrib['id'])
date_unix = int(photo_info[0][4].attrib['posted'])
date = datetime.datetime.utcfromtimestamp(date_unix).strftime('%Y-%m-%d %H:%M:%S')
url = 'https://www.flickr.com/photos/' + photo.attrib['owner'] + '/' + photo.attrib['id']
## the csv is filled with the acquired information
csv.write('%s,%s,%s,%s,%s,%s,%s,%s \n' % (id_count,
photo.attrib['id'],
lat,
lon,
views,
date_unix,
date,
url))
id_count += 1
## 1 second wait time between calls to prevent error code 500
time.sleep(1)
## Turns the page
page_number += page_number
## Total number of photos searched
print(sum(1 for line in open(flickr_data.csv))-1)
csv.close()
以下是 xml 的节选,return 由 flickr.photos_search
<rsp stat="ok">
<photos page="1" pages="16" perpage="250" total="3755">
<photo id="50094525552" owner="98355876@N00" secret="6d66d421af" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="250" />
<photo id="50093709173" owner="98355876@N00" secret="90c31cac1d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="260" />
<photo id="50093706783" owner="98355876@N00" secret="9521b8ba7d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="224" />
<photo id="50093641658" owner="82692690@N02" secret="e26afb1e79" server="65535" farm="66" title="Cabecera. Catedral gótica de Colonia. JX3." ispublic="1" isfriend="0" isfamily="0" views="201" />
<photo id="50090280721" owner="98355876@N00" secret="cc0e2d7b8b" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="295" />
<photo id="50090278631" owner="98355876@N00" secret="8113aaa628" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="280" />
<photo id="50090277186" owner="98355876@N00" secret="73753c811d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="320" />
<photo id="50090150901" owner="136678496@N04" secret="6de14ca572" server="65535" farm="66" title="Good Morning" ispublic="1" isfriend="0" isfamily="0" views="104" />
<photo id="50089819277" owner="7283893@N05" secret="43e5290b07" server="65535" farm="66" title="Der Chef / The Boss" ispublic="1" isfriend="0" isfamily="0" views="421" />
以下是脚本的输出,每个 for 循环结束时打印 ID 计数,每个 while 循环打印页码:
1
2
3
4
5
6
7
8
9
(...)
245
246
247
248
249
250
-------- PAGE 2 --------
251
252
253
254
255
256
257
(...)
493
494
495
496
497
498
499
500
-------- PAGE 4 --------
501
502
503
504
505
506
507
(...)
743
744
745
746
747
748
749
750
-------- PAGE 8 --------
751
752
753
754
755
756
757
758
759
(...)
990
991
992
993
994
995
996
997
998
999
1000
-------- PAGE 16 --------
1001
1002
1003
1004
1005
-------- PAGE 32 --------
正如您所发现的,pagenumber
由于迭代器在 while
循环结束时跳过了很多 API 结果,所以翻倍了:
page_number += page_number
要修复,只需调整为递增 1:
page_number += 1