python:从 couchdb 端点获取 npm 包数据
python: getting npm package data from a couchdb endpoint
我想获取 npm 包元数据。我找到了 this 端点,它为我提供了所需的所有元数据。
我制作了以下脚本来获取这些数据。我的计划是 select 一些特定的键并将该数据添加到某个数据库中(我也可以将其存储在 json 文件中,但数据量很大)。我制作了以下脚本来获取数据:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
注意,我什至没有包含 all-docs
,但它会陷入无限循环。我认为这是因为数据量很大。
看看输出的头部 - https://replicate.npmjs.com/_all_docs
给我以下输出:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
注意,所有文档都从第二行开始(即所有文档都是 "rows" 键值的一部分)。现在,我的问题是如何只获取 "rows" 键的值(即所有文档)。我找到了用于类似目的的 this 存储库,但无法使用/转换它,因为我是 JavaScript.
的初学者
在遍历行之前不解码 json 是否有原因?
你能试试这个吗:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)
如果 get()
的参数中没有 stream=True
那么整个数据将被下载到内存 before 循环甚至开始.
还有一个问题,至少这些行本身是无效的JSON。为此,您需要像 ijson
这样的增量 JSON 解析器。 ijson
又想要一个像对象这样的文件,它不容易从 requests.Response
中获得,所以我将在这里使用 Python 标准库中的 urllib
:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()
我想获取 npm 包元数据。我找到了 this 端点,它为我提供了所需的所有元数据。
我制作了以下脚本来获取这些数据。我的计划是 select 一些特定的键并将该数据添加到某个数据库中(我也可以将其存储在 json 文件中,但数据量很大)。我制作了以下脚本来获取数据:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
注意,我什至没有包含 all-docs
,但它会陷入无限循环。我认为这是因为数据量很大。
看看输出的头部 - https://replicate.npmjs.com/_all_docs
给我以下输出:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
注意,所有文档都从第二行开始(即所有文档都是 "rows" 键值的一部分)。现在,我的问题是如何只获取 "rows" 键的值(即所有文档)。我找到了用于类似目的的 this 存储库,但无法使用/转换它,因为我是 JavaScript.
的初学者在遍历行之前不解码 json 是否有原因?
你能试试这个吗:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)
如果 get()
的参数中没有 stream=True
那么整个数据将被下载到内存 before 循环甚至开始.
还有一个问题,至少这些行本身是无效的JSON。为此,您需要像 ijson
这样的增量 JSON 解析器。 ijson
又想要一个像对象这样的文件,它不容易从 requests.Response
中获得,所以我将在这里使用 Python 标准库中的 urllib
:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()