django 插入到 Sqlite 的时间太长
inserting to SqlLite takes too long by django
你好我有 26 个文件(每个 ~100MB)我尝试通过这个视图插入:
def index(request):
url = '../xaa'
count = 0
line_num = 1660792
start = time.time()
for lines in fileinput.input([url]):
user = ast.literal_eval(lines)
T.objects.create(a=user['a'], b=user['b'], c=user['c'])
count += 1
percent = (100 * count) / line_num
print(f"{percent}%")
end = time.time()
print(f"Time : {end - start}%")
response = HttpResponse('Done')
return response
但是它花费的时间太长(一个文件需要 3.5 天)我怎样才能更快地完成它?
您正在逐一阅读 166 万行代码,并在您的代码中逐一创建模型实例。您的操作存在严重问题:
首先,您 一个一个地创建每个对象,这意味着创建一个 每个 对象的查询!那是166万次查询!如果这不需要时间,那么什么会?接下来在每次迭代中进行打印,尽管这在小程序中可能并不明显,但打印也需要大量时间,打印的数量只会减慢程序的速度。
如果你想批量创建很多对象,你可以使用 bulk_create
[Django docs] 方法,尽管考虑到你有这么多行,也许你应该批量创建:
def index(request):
url = '../xaa'
count = 0
batch_size = 100 # Will insert each 100th time
line_num = 1660792
start = time.time()
batch_list = []
for lines in fileinput.input([url]):
if count % batch_size == 0:
T.objects.bulk_create(batch_list)
batch_list = []
user = ast.literal_eval(lines)
batch_list.append(T(a=user['a'], b=user['b'], c=user['c']))
count += 1
# Forego printing
# percent = (100 * count) / line_num
# print(f"{percent}%")
if batch_list: # If any objects remaining
T.objects.bulk_create(batch_list)
batch_list = []
end = time.time()
print(f"Time : {end - start}%")
response = HttpResponse('Done')
return response
向前看,您的文件似乎采用了某种格式,例如 JSON 行。您可以查看 Providing data with fixtures [Django docs]. These fixtures support the JSON Lines from Django 3.2 onwards (See Serialization formats [Django docs]). You may have to modify your files a little to fit the structure expected by these fixtures but then you can leave this loading to the command loaddata
[Django docs]
的文档
你好我有 26 个文件(每个 ~100MB)我尝试通过这个视图插入:
def index(request):
url = '../xaa'
count = 0
line_num = 1660792
start = time.time()
for lines in fileinput.input([url]):
user = ast.literal_eval(lines)
T.objects.create(a=user['a'], b=user['b'], c=user['c'])
count += 1
percent = (100 * count) / line_num
print(f"{percent}%")
end = time.time()
print(f"Time : {end - start}%")
response = HttpResponse('Done')
return response
但是它花费的时间太长(一个文件需要 3.5 天)我怎样才能更快地完成它?
您正在逐一阅读 166 万行代码,并在您的代码中逐一创建模型实例。您的操作存在严重问题:
首先,您 一个一个地创建每个对象,这意味着创建一个 每个 对象的查询!那是166万次查询!如果这不需要时间,那么什么会?接下来在每次迭代中进行打印,尽管这在小程序中可能并不明显,但打印也需要大量时间,打印的数量只会减慢程序的速度。
如果你想批量创建很多对象,你可以使用 bulk_create
[Django docs] 方法,尽管考虑到你有这么多行,也许你应该批量创建:
def index(request):
url = '../xaa'
count = 0
batch_size = 100 # Will insert each 100th time
line_num = 1660792
start = time.time()
batch_list = []
for lines in fileinput.input([url]):
if count % batch_size == 0:
T.objects.bulk_create(batch_list)
batch_list = []
user = ast.literal_eval(lines)
batch_list.append(T(a=user['a'], b=user['b'], c=user['c']))
count += 1
# Forego printing
# percent = (100 * count) / line_num
# print(f"{percent}%")
if batch_list: # If any objects remaining
T.objects.bulk_create(batch_list)
batch_list = []
end = time.time()
print(f"Time : {end - start}%")
response = HttpResponse('Done')
return response
向前看,您的文件似乎采用了某种格式,例如 JSON 行。您可以查看 Providing data with fixtures [Django docs]. These fixtures support the JSON Lines from Django 3.2 onwards (See Serialization formats [Django docs]). You may have to modify your files a little to fit the structure expected by these fixtures but then you can leave this loading to the command loaddata
[Django docs]