仅限 FlickrAPI returns 个不完整的结果

FlickrAPI only returns an incomplete number of results

我的目标是从德国科隆市地理边界内的 Flickr 数据库中提取地理数据(纬度和经度值)、视图、照片 ID、url 和发布日期。然后将数据写入 csv 文件。仅使用 tags='Köln' 的结果总数约为 110.000。我想从中提取 至少 一个 5 位数的数据点。为此,我设置了三个分隔符:标签、最大上传日期和最小上传日期。

已生效:数据已成功写入csv。

什么还不起作用:当我使用 xml.etree.ElementTree.dump() return 搜索结果时,我可以看到针对各个搜索参数找到了大约 3,700 个结果。据我所知,这个数字在 Flickr 设置的每个查询 4,000 个结果的限制范围内。但是,只有 700 到 1,000 个数据点被写入 csv 文件。这个数字永远不会相同,并且会随着每次执行而变化,这很奇怪,因为我明确定义了时间范围。此外,尽管在使用 time.sleep(1) 的调用之间添加了一个计时器,但我仍然时不时地被服务器踢出(错误代码 500)。在几乎没有记录的限制下苦苦挣扎之后,我真的不知道为什么我的代码仍然没有按预期工作。

我使用的代码如下:

import flickrapi
import os
import datetime
import time

## Only needed to explore the xml tree
# import xml 

## API key and secret provided by Flickr
api_key = 'api key'
api_secret = 'api secret'

## Approximate geographic coordinates of the administrational boundaries of the city of Cologne
boundaries = '6.8064182,50.8300729,7.1528453,51.0837915'

## Counter for the ID column required by GIS software
id_count = 1

## Creation of an editable csv file and its top row
csv = open('flickr_data.csv', mode='a')
if(os.stat('flickr_data.csv').st_size == 0):
    csv.write('ID,Photo_ID,Lat,Lon,Views,Taken_Unix,Taken,URL \n')

## Authentication of the Flickr API
flickr = flickrapi.FlickrAPI(api_key, api_secret)

## Page counter
page_number = 1

## Only needed to explore the xml tree
# test_list = flickr.photos_search(max_upload_date = '2020-07-09 23:59:59',min_upload_date = '2020-01-15 0:00:00',tags = 'Köln',bbox = boundaries,has_geo = '1',page = 1,extras = 'views',per_page = '250')
# xml.etree.ElementTree.dump(test_list)

## While loop keeps running until page 16 is reached. The total number of pages for the wanted search query is 452.
## However, Flickr only returns a number of photos equivalent to 16 pages of 250 results.
## At this point, the code is reiterated until the maximum number of pages is reached.
while page_number < 17:
    
    ## Flickr search for the geographic boundaries of Cologne, Germany.
    photo_list = flickr.photos_search(tags = 'Köln',
                                      max_upload_date = '2020-07-09 23:59:59',
                                      min_upload_date = '2020-01-15 00:00:00',
                                      bbox = boundaries,
                                      has_geo = '1',
                                      page = page_number,
                                      extras = 'views',
                                      per_page = '250') ## maximum allowed photos per page for bbox-delimited requests
    
    ## For loop keeps running as long as there are photos on a page
    for photo in photo_list[0]:
        ## extraction of latitude and longitude data from the search results
        geodata = flickr.photos_geo_getLocation(photo_id = photo.attrib['id'])
        lat = geodata[0][0].attrib['latitude']
        lon = geodata[0][0].attrib['longitude']
        
        ## extraction of views from the search results
        views = photo.get('views')

        ## extraction and conversion of upload dates
        photo_info = flickr.photos.getInfo(photo_id = photo.attrib['id'])
        date_unix = int(photo_info[0][4].attrib['posted'])
        date = datetime.datetime.utcfromtimestamp(date_unix).strftime('%Y-%m-%d %H:%M:%S')
        url = 'https://www.flickr.com/photos/' + photo.attrib['owner'] + '/' + photo.attrib['id']

        
        
        ## the csv is filled with the acquired information
        csv.write('%s,%s,%s,%s,%s,%s,%s,%s \n' % (id_count,
                                                  photo.attrib['id'],
                                                  lat,
                                                  lon,
                                                  views,
                                                  date_unix,
                                                  date,
                                                  url))
        id_count += 1
        ## 1 second wait time between calls to prevent error code 500
        time.sleep(1)
    ## Turns the page
    page_number += page_number

## Total number of photos searched
print(sum(1 for line in open(flickr_data.csv))-1)

csv.close()

以下是 xml 的节选,return 由 flickr.photos_search

编辑
<rsp stat="ok">
<photos page="1" pages="16" perpage="250" total="3755">
    <photo id="50094525552" owner="98355876@N00" secret="6d66d421af" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="250" />
    <photo id="50093709173" owner="98355876@N00" secret="90c31cac1d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="260" />
    <photo id="50093706783" owner="98355876@N00" secret="9521b8ba7d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="224" />
    <photo id="50093641658" owner="82692690@N02" secret="e26afb1e79" server="65535" farm="66" title="Cabecera. Catedral gótica de Colonia. JX3." ispublic="1" isfriend="0" isfamily="0" views="201" />
    <photo id="50090280721" owner="98355876@N00" secret="cc0e2d7b8b" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="295" />
    <photo id="50090278631" owner="98355876@N00" secret="8113aaa628" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="280" />
    <photo id="50090277186" owner="98355876@N00" secret="73753c811d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="320" />
    <photo id="50090150901" owner="136678496@N04" secret="6de14ca572" server="65535" farm="66" title="Good Morning" ispublic="1" isfriend="0" isfamily="0" views="104" />
    <photo id="50089819277" owner="7283893@N05" secret="43e5290b07" server="65535" farm="66" title="Der Chef / The Boss" ispublic="1" isfriend="0" isfamily="0" views="421" />

以下是脚本的输出,每个 for 循环结束时打印 ID 计数,每个 while 循环打印页码:

1
2
3
4
5
6
7
8
9

(...)

245
246
247
248
249
250
-------- PAGE 2 --------
251
252
253
254
255
256
257

(...)

493
494
495
496
497
498
499
500
-------- PAGE 4 --------
501
502
503
504
505
506
507

(...)

743
744
745
746
747
748
749
750
-------- PAGE 8 --------
751
752
753
754
755
756
757
758
759

(...)

990
991
992
993
994
995
996
997
998
999
1000
-------- PAGE 16 --------
1001
1002
1003
1004
1005
-------- PAGE 32 --------

正如您所发现的,pagenumber 由于迭代器在 while 循环结束时跳过了很多 API 结果,所以翻倍了:

page_number += page_number

要修复,只需调整为递增 1:

page_number += 1