如何将带有 gridfs 的大型 json 文件直接导入到 mongodb 并准备好架构

Question

通常我们可以使用 mongoimport 命令从 shell 上传 JSON 文件和 Large 也到 mongodb，我们将在集合中准备好我们的模式，而无需担心最大大小 (16MB)，因为 mongo 将处理批处理大小等（这已经过测试和工作），数据会成行，一切都很好。

这里的主要问题是如何通过使用 pymongo 和 GridFS 对 python 做同样的事情。当我使用 GridFS 时，它正在上传到不同类型的集合 (*.files)，并且未像第一种方法那样定义架构。文件以字节为单位，集合名称为 *.files

我想知道如何执行 python 方法并获得与使用 mongoimport 命令

相同的结果

我的代码是：

fs = gridfs.GridFS(db, collection='test_collection')
with open(path_to_big_json_file, 'rb') as dictionary:
    fs.put(dictionary, filename='test_filename')

结果如下：

我的目标 是立即在普通集合中准备好模式，而不是像这样在 GridFS 集合中准备好模式：

我试过在 pymongo 中批量插入，但由于文件太大，它没有用，我确信我们会有办法，没有必要使用 GridFS，但让我们将其保留在 python

谢谢！

Answer 1

好的，我已经创建了一个函数来分割数据框并暂时保存它们以便稍后插入

它致力于将每个数据帧缩放到小于 16 MB 的 10%，然后我们摄取

P.S：用于 GeoJSON

def insert_geojson_in_batches_to_mongo(mongoclient, db, collection_name, origin_path, threshold=10):

  df = gpd.read_file(transformed_path)
  file_size = os.path.getsize(origin_path)
  max_size = mongoclient.max_bson_size
  number_of_dataframes = ceil(file_size*(1+threshold/100) / max_size)
  df_len = len(df)
  number_of_rows_per_df = floor(df_len/number_of_dataframes)
  collection = db.get_collection(collection_name)

  with tempfile.TemporaryDirectory() as tmpdirname:
    count = 0
    k = 0
    while True:
      if count > df_len-1:
            break          
      filename = tmpdirname + 'df' + str(count) + '.geojson'
      start = count
      count += number_of_rows_per_df  
      k += 1
      df.iloc[start : count].to_file(filename, driver="GeoJSON")      
      with open(filename) as f: 
        data = json.load(f)  
      data = data['features']
      print('bulk {0}/{1} is being loaded'.format(k , number_of_dataframes+1))
      collection.insert_many(data)

如何将带有 gridfs 的大型 json 文件直接导入到 mongodb 并准备好架构

How to import a large json file with gridfs to mongodb directly with schema ready

python

json

mongodb

pymongo

gridfs