在 Cosmos DB 和分片键支持下,恰好 50% 的文档被删除
Exactly 50% of documents are deleted with Cosmos DB and shard key support
我正在使用 Cosmos DB、Mongo DB 3.6 API 以及启用了自动缩放和分片键的集合。
我正在使用这个 ARM 模板:https://github.com/Azure/azure-quickstart-templates/blob/master/101-cosmosdb-mongodb-autoscale/azuredeploy.json.
我有一段代码可以在使用 C# 驱动程序启动应用程序之前清理集合。
我不使用 BulkWriteAsync
的原因是我不想溢出我的吞吐量设置(目前是 500 - 5000 RU)
foreach (var collectionName in Collections)
{
var collection = database.GetCollection<BsonDocument>(collectionName);
long count = await collection.CountDocumentsAsync(
Builders<BsonDocument>.Filter.Empty, null, cancellationToken);
long deleted = 0;
while (deleted < count)
{
var nextBatchCount = (int)Math.Min(count - deleted, BatchSizeDelete);
var batch = await collection
.Aggregate()
.Skip((int)deleted)
.Limit(nextBatchCount)
.Project(Builders<BsonDocument>.Projection.Include("_id"))
.ToListAsync(cancellationToken);
deleted += nextBatchCount;
await collection.DeleteManyAsync(
Builders<BsonDocument>.Filter.In("_id", batch.Select(x => x["_id"])), cancellationToken);
Log.Information("Deleted {deleted} from {count} records", deleted, count);
await Task.Delay(TimeSpan.FromSeconds(0.5), cancellationToken);
}
}
部署模板在这里:
{
"type": "Microsoft.DocumentDB/databaseAccounts/mongodbDatabases",
"apiVersion": "2020-06-01-preview",
"name": "[concat(parameters('account_name'), '/PatientRecords')]",
"dependsOn": [],
"properties": {
"resource": {
"id": "PatientRecords"
},
"options": {
"autoscaleSettings": {
"maxThroughput": "[parameters('autoscaleMaxThroughput')]"
}
}
}
},
{
"type": "Microsoft.DocumentDB/databaseAccounts/mongodbDatabases/collections",
"apiVersion": "2020-06-01-preview",
"name": "[concat(parameters('account_name'), '/PatientRecords/PaRecords')]",
"dependsOn": [
"[resourceId('Microsoft.DocumentDB/databaseAccounts/mongodbDatabases', parameters('account_name'), 'PatientRecords')]"
],
"properties": {
"resource": {
"id": "PaRecords",
"shardKey": {
"ClinicId": "Hash"
}
},
"options": {
"throughput": 400
}
}
},
{
"type": "Microsoft.DocumentDB/databaseAccounts/mongodbDatabases/collections",
"apiVersion": "2020-06-01-preview",
"name": "[concat(parameters('account_name'), '/PatientRecords/PatientRecords')]",
"dependsOn": [
"[resourceId('Microsoft.DocumentDB/databaseAccounts/mongodbDatabases', parameters('account_name'), 'PatientRecords')]"
],
"properties": {
"resource": {
"id": "PatientRecords",
"shardKey": {
"ClinicId": "Hash"
}
},
"options": {}
}
}
它一直有效,直到我在 ARM 模板中启用 shardKey
,现在由于某种原因 正好 50% 的文档被删除。比如集合中有25000个文档,只删除12500个,以此类推。
我也试了WriteBulkAsync
,但是都是一样的
这种奇怪行为的根源是什么,或者我的方法有误?
我能够自己缓解这个问题。不完全清楚是什么原因导致不完全删除,但以下代码有效:
// Fetch all data as '_id'
var data = await collection
.Aggregate()
.Project(Builders<BsonDocument>.Projection.Include("_id"))
.ToListAsync();
if (data.Count > 0)
{
// Use bulk write and DeleteOneModel
await collection.BulkWriteAsync(
data.Select(x => new DeleteOneModel<BsonDocument>(Builders<BsonDocument>.Filter.Eq("_id", x["_id"]))),
new BulkWriteOptions() { BypassDocumentValidation = true },
cancellationToken
);
Log.Information("Deleted {Count} documents in {collectionName}", data.Count, collectionName);
}
我正在使用 Cosmos DB、Mongo DB 3.6 API 以及启用了自动缩放和分片键的集合。 我正在使用这个 ARM 模板:https://github.com/Azure/azure-quickstart-templates/blob/master/101-cosmosdb-mongodb-autoscale/azuredeploy.json.
我有一段代码可以在使用 C# 驱动程序启动应用程序之前清理集合。
我不使用 BulkWriteAsync
的原因是我不想溢出我的吞吐量设置(目前是 500 - 5000 RU)
foreach (var collectionName in Collections)
{
var collection = database.GetCollection<BsonDocument>(collectionName);
long count = await collection.CountDocumentsAsync(
Builders<BsonDocument>.Filter.Empty, null, cancellationToken);
long deleted = 0;
while (deleted < count)
{
var nextBatchCount = (int)Math.Min(count - deleted, BatchSizeDelete);
var batch = await collection
.Aggregate()
.Skip((int)deleted)
.Limit(nextBatchCount)
.Project(Builders<BsonDocument>.Projection.Include("_id"))
.ToListAsync(cancellationToken);
deleted += nextBatchCount;
await collection.DeleteManyAsync(
Builders<BsonDocument>.Filter.In("_id", batch.Select(x => x["_id"])), cancellationToken);
Log.Information("Deleted {deleted} from {count} records", deleted, count);
await Task.Delay(TimeSpan.FromSeconds(0.5), cancellationToken);
}
}
部署模板在这里:
{
"type": "Microsoft.DocumentDB/databaseAccounts/mongodbDatabases",
"apiVersion": "2020-06-01-preview",
"name": "[concat(parameters('account_name'), '/PatientRecords')]",
"dependsOn": [],
"properties": {
"resource": {
"id": "PatientRecords"
},
"options": {
"autoscaleSettings": {
"maxThroughput": "[parameters('autoscaleMaxThroughput')]"
}
}
}
},
{
"type": "Microsoft.DocumentDB/databaseAccounts/mongodbDatabases/collections",
"apiVersion": "2020-06-01-preview",
"name": "[concat(parameters('account_name'), '/PatientRecords/PaRecords')]",
"dependsOn": [
"[resourceId('Microsoft.DocumentDB/databaseAccounts/mongodbDatabases', parameters('account_name'), 'PatientRecords')]"
],
"properties": {
"resource": {
"id": "PaRecords",
"shardKey": {
"ClinicId": "Hash"
}
},
"options": {
"throughput": 400
}
}
},
{
"type": "Microsoft.DocumentDB/databaseAccounts/mongodbDatabases/collections",
"apiVersion": "2020-06-01-preview",
"name": "[concat(parameters('account_name'), '/PatientRecords/PatientRecords')]",
"dependsOn": [
"[resourceId('Microsoft.DocumentDB/databaseAccounts/mongodbDatabases', parameters('account_name'), 'PatientRecords')]"
],
"properties": {
"resource": {
"id": "PatientRecords",
"shardKey": {
"ClinicId": "Hash"
}
},
"options": {}
}
}
它一直有效,直到我在 ARM 模板中启用 shardKey
,现在由于某种原因 正好 50% 的文档被删除。比如集合中有25000个文档,只删除12500个,以此类推。
我也试了WriteBulkAsync
,但是都是一样的
这种奇怪行为的根源是什么,或者我的方法有误?
我能够自己缓解这个问题。不完全清楚是什么原因导致不完全删除,但以下代码有效:
// Fetch all data as '_id'
var data = await collection
.Aggregate()
.Project(Builders<BsonDocument>.Projection.Include("_id"))
.ToListAsync();
if (data.Count > 0)
{
// Use bulk write and DeleteOneModel
await collection.BulkWriteAsync(
data.Select(x => new DeleteOneModel<BsonDocument>(Builders<BsonDocument>.Filter.Eq("_id", x["_id"]))),
new BulkWriteOptions() { BypassDocumentValidation = true },
cancellationToken
);
Log.Information("Deleted {Count} documents in {collectionName}", data.Count, collectionName);
}