Azure Table 存储多行查询性能
Azure Table Storage multi row query performance
我们在使用 Azure Table 存储的服务中遇到了问题,有时查询需要多秒(3 到 30 秒)。这种情况每天都会发生,但仅针对某些查询。我们的服务和 table 存储没有很大的负载(每小时大约数百次调用)。但是 table 存储仍然没有执行。
慢速查询都在执行过滤查询,最多应 return 10 行。我对过滤器进行了结构化处理,以便在 or 运算符之后始终存在分区键和行键,然后是下一对分区键和行键:
(partitionKey1 and RowKey1) or (partitionKey2 and rowKey2) or (partitionKey3 and rowKey3)
所以目前我的前提是我需要将查询拆分为单独的查询。我用 python 脚本验证了这一点。当我将相同的查询重复为单个查询(使用 or 的组合查询并期望得到多行结果)或拆分为在单独的线程中执行的多个查询时,我发现组合查询时不时地变慢。
import time
import threading
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
############################################################################
# Script for querying data from azure table storage or cosmos DB table API.
# SAS token needs to be generated for using this script and a table with data
# needs to exist.
#
# Warning: extensive use of this script may burden the table performance,
# so use with care.
#
# PIP requirements:
# - requires azure-cosmosdb-table to be installed
# * run: 'pip install azure-cosmosdb-table'
dateTimeSince = '2019-06-12T13:16:45.446Z'
sasToken = 'SAS_TOKEN_HERE'
tableName = 'TABLE_NAME_HERER'
table_service = TableService(account_name="ACCOUNT_NAME_HERE", sas_token=sasToken)
tableFilter = "(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0') and (RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34') and (RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc') and (RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56') and (RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a') and (RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')"
resultDict = {}
# Do separate queries
filters = tableFilter.split(" or ")
threads = []
def runQueryPrintResult(filter):
result = table_service.query_entities(table_name=tableName, filter=filter)
item = result.items[0]
resultDict[item.RowKey] = item
# Loop where:
# - Step 1: test is run with tableFilter query split to multiple threads
# * returns single row per query
# - Step 2: Query is runs tableFilter query as single query
# - Press enter to repeat the two query tests
while 1:
start2 = time.time()
for filter in filters:
x = threading.Thread(target=runQueryPrintResult, args=(filter,))
x.start()
threads.append(x)
for x in threads:
x.join()
end2 = time.time()
print("Time elapsed with multi threaded implementation: {}".format(end2-start2))
# Do single query
start1 = time.time()
listGenerator = table_service.query_entities(table_name=tableName, filter=tableFilter)
end1 = time.time()
print("Time elapsed with single query: {}".format(end1-start1))
counter = 0
allVerified = True
for item in listGenerator:
if resultDict[item.RowKey]:
counter += 1
else:
allVerified = False
if len(listGenerator.items) != len(resultDict):
allVerified = False
print("table item count since x: " + str(counter))
if allVerified:
print("Both queries returned same amount of results")
else:
print("Result count does not match, single threaded count={}, multithreaded count={}".format(len(listGenerator.items), len(resultDict)))
input('Press enter to retry test!')
这是 python 代码的示例输出:
Time elapsed with multi threaded implementation: 0.10776209831237793
Time elapsed with single query: 0.2323908805847168
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.0897986888885498
Time elapsed with single query: 0.21547174453735352
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.08280491828918457
Time elapsed with single query: 3.2932426929473877
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07794523239135742
Time elapsed with single query: 1.4898555278778076
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07962584495544434
Time elapsed with single query: 0.20011520385742188
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
虽然我们遇到问题的服务是在 C# 中实现的,但我尚未在 C# 端重现使用 python 脚本获得的结果。与使用单个过滤器查询(returning 所有必需的行)相比,将查询拆分为多个单独的查询时,我的性能似乎更差。
所以多次执行并等待所有完成似乎更慢:
TableOperation getOperation =
TableOperation.Retrieve<HqrScreenshotItemTableEntity>(partitionKey, id.ToString());
TableResult result = await table.ExecuteAsync(getOperation);
比在单个查询中完成所有操作:
private IEnumerable<MyTableEntity> GetBatchedItemsTableResult(Guid[] ids, string applicationLink)
{
var table = InitializeTableStorage();
TableQuery<MyTableEntity> itemsQuery=
new TableQuery<MyTableEntity>().Where(TableQueryConstructor(ids, applicationLink));
IEnumerable<MyTableEntity> result = table.ExecuteQuery(itemsQuery);
return result;
}
public string TableQueryConstructor(Guid[] ids, string applicationLink)
{
var fullQuery = new StringBuilder();
foreach (var id in ids)
{
// Encode link before setting to partition key as REST GET requests
// do not accept non encoded URL params by default)
partitionKey = HttpUtility.UrlEncode(applicationLink);
// Create query for single row in a requested partition
string queryForRow = TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey),
TableOperators.And,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, id.ToString()));
if (fullQuery.Length == 0)
{
// Append query for first row
fullQuery.Append(queryForRow);
}
else
{
// Append query for subsequent rows with or operator to make queries independent of each other.
fullQuery.Append($" {TableOperators.Or} ");
fullQuery.Append(queryForRow);
}
}
return fullQuery.ToString();
}
用于 C# 代码的测试用例与 python 测试完全不同。在 C# 中,我从大约 100000 行的数据中查询 2000 行。如果以 50 行为一组查询数据,则后一个过滤器查询在 50 个任务中胜过单行查询 运行。
也许我应该重复我在 C# 中用 python 作为控制台应用程序所做的测试,看看 .Net 客户端 api 的行为是否与 python 相同性能副。
作为答案发布,因为评论越来越大。
您能否尝试将查询更改为如下内容:
(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0' and RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34' and RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc' and RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56' and RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a' and RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')
我认为你应该使用多线程实现,因为它由多个点查询组成。在单个查询中执行所有操作可能会导致 Table Scan。正如 official doc 提到的:
Using an "or" to specify a filter based on RowKey values results in a partition scan and is not treated as a range query. Therefore, you should avoid queries that use filters such as: $filter=PartitionKey eq 'Sales' and (RowKey eq '121' or RowKey eq '322')
您可能认为上面的示例是两个 点查询,但它实际上会导致 分区扫描。
对我来说,这里的答案似乎是,在 table 存储上执行查询并没有像您期望的那样优化为与 OR 运算符一起工作。当点查询与OR
运算符组合时,查询不作为点查询处理。
这可以在 python、C# 和 Azure 存储资源管理器中重现,其中如果将点查询与 OR
结合使用,它可能比单独的点查询慢 10 倍(甚至更多)查询只有 return 一行。
因此,获取已知分区键和行键的行数的最有效方法是使用 TableOperation.Retrieve
(在 C# 中)通过单独的异步查询来完成所有操作。使用 TableQuery
效率非常低,并且不会产生接近 Azure Table 存储性能可扩展性目标的结果。可扩展性目标例如:"Target throughput for a single table partition (1 KiB-entities) Up to 2,000 entities per second"。在这里,尽管所有行都在不同的分区中,但我什至无法获得每秒 5 行的服务。
这种查询性能的限制在任何文档或性能优化指南中都没有明确说明,但可以从 Azure storage performance checklist:
中的这些行中理解
Querying
This section describes proven practices for querying the table service.
Query scope
There are several ways to specify the range of entities to query. The following is a discussion of the uses of each.
In general, avoid scans (queries larger than a single entity), but if you must scan, try to organize your data so that your scans retrieve the data you need without scanning or returning significant amounts of entities you don't need.
Point queries
A point query retrieves exactly one entity. It does this by specifying both the partition key and row key of the entity to retrieve. These queries are efficient, and you should use them wherever possible.
Partition queries
A partition query is a query that retrieves a set of data that shares a common partition key. Typically, the query specifies a range of row key values or a range of values for some entity property in addition to a partition key. These are less efficient than point queries, and should be used sparingly.
Table queries
A table query is a query that retrieves a set of entities that does not share a common partition key. These queries are not efficient and you should avoid them if possible.
所以 "A point query retrieves exactly one entity" 和 "Use point queries when ever possible"。由于我已将数据拆分为多个分区,因此可能已将其处理为 table 查询:"A table query is a query that retrieves a set of entities that does not share a common partition key"。这虽然查询组合了一组点查询,因为它列出了所有预期实体的分区键和行键。但是由于组合查询不是只检索一个查询,因此不能期望它作为点查询(或一组点查询)执行。
我们在使用 Azure Table 存储的服务中遇到了问题,有时查询需要多秒(3 到 30 秒)。这种情况每天都会发生,但仅针对某些查询。我们的服务和 table 存储没有很大的负载(每小时大约数百次调用)。但是 table 存储仍然没有执行。
慢速查询都在执行过滤查询,最多应 return 10 行。我对过滤器进行了结构化处理,以便在 or 运算符之后始终存在分区键和行键,然后是下一对分区键和行键:
(partitionKey1 and RowKey1) or (partitionKey2 and rowKey2) or (partitionKey3 and rowKey3)
所以目前我的前提是我需要将查询拆分为单独的查询。我用 python 脚本验证了这一点。当我将相同的查询重复为单个查询(使用 or 的组合查询并期望得到多行结果)或拆分为在单独的线程中执行的多个查询时,我发现组合查询时不时地变慢。
import time
import threading
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
############################################################################
# Script for querying data from azure table storage or cosmos DB table API.
# SAS token needs to be generated for using this script and a table with data
# needs to exist.
#
# Warning: extensive use of this script may burden the table performance,
# so use with care.
#
# PIP requirements:
# - requires azure-cosmosdb-table to be installed
# * run: 'pip install azure-cosmosdb-table'
dateTimeSince = '2019-06-12T13:16:45.446Z'
sasToken = 'SAS_TOKEN_HERE'
tableName = 'TABLE_NAME_HERER'
table_service = TableService(account_name="ACCOUNT_NAME_HERE", sas_token=sasToken)
tableFilter = "(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0') and (RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34') and (RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc') and (RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56') and (RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a') and (RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')"
resultDict = {}
# Do separate queries
filters = tableFilter.split(" or ")
threads = []
def runQueryPrintResult(filter):
result = table_service.query_entities(table_name=tableName, filter=filter)
item = result.items[0]
resultDict[item.RowKey] = item
# Loop where:
# - Step 1: test is run with tableFilter query split to multiple threads
# * returns single row per query
# - Step 2: Query is runs tableFilter query as single query
# - Press enter to repeat the two query tests
while 1:
start2 = time.time()
for filter in filters:
x = threading.Thread(target=runQueryPrintResult, args=(filter,))
x.start()
threads.append(x)
for x in threads:
x.join()
end2 = time.time()
print("Time elapsed with multi threaded implementation: {}".format(end2-start2))
# Do single query
start1 = time.time()
listGenerator = table_service.query_entities(table_name=tableName, filter=tableFilter)
end1 = time.time()
print("Time elapsed with single query: {}".format(end1-start1))
counter = 0
allVerified = True
for item in listGenerator:
if resultDict[item.RowKey]:
counter += 1
else:
allVerified = False
if len(listGenerator.items) != len(resultDict):
allVerified = False
print("table item count since x: " + str(counter))
if allVerified:
print("Both queries returned same amount of results")
else:
print("Result count does not match, single threaded count={}, multithreaded count={}".format(len(listGenerator.items), len(resultDict)))
input('Press enter to retry test!')
这是 python 代码的示例输出:
Time elapsed with multi threaded implementation: 0.10776209831237793
Time elapsed with single query: 0.2323908805847168
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.0897986888885498
Time elapsed with single query: 0.21547174453735352
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.08280491828918457
Time elapsed with single query: 3.2932426929473877
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07794523239135742
Time elapsed with single query: 1.4898555278778076
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07962584495544434
Time elapsed with single query: 0.20011520385742188
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
虽然我们遇到问题的服务是在 C# 中实现的,但我尚未在 C# 端重现使用 python 脚本获得的结果。与使用单个过滤器查询(returning 所有必需的行)相比,将查询拆分为多个单独的查询时,我的性能似乎更差。
所以多次执行并等待所有完成似乎更慢:
TableOperation getOperation =
TableOperation.Retrieve<HqrScreenshotItemTableEntity>(partitionKey, id.ToString());
TableResult result = await table.ExecuteAsync(getOperation);
比在单个查询中完成所有操作:
private IEnumerable<MyTableEntity> GetBatchedItemsTableResult(Guid[] ids, string applicationLink)
{
var table = InitializeTableStorage();
TableQuery<MyTableEntity> itemsQuery=
new TableQuery<MyTableEntity>().Where(TableQueryConstructor(ids, applicationLink));
IEnumerable<MyTableEntity> result = table.ExecuteQuery(itemsQuery);
return result;
}
public string TableQueryConstructor(Guid[] ids, string applicationLink)
{
var fullQuery = new StringBuilder();
foreach (var id in ids)
{
// Encode link before setting to partition key as REST GET requests
// do not accept non encoded URL params by default)
partitionKey = HttpUtility.UrlEncode(applicationLink);
// Create query for single row in a requested partition
string queryForRow = TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey),
TableOperators.And,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, id.ToString()));
if (fullQuery.Length == 0)
{
// Append query for first row
fullQuery.Append(queryForRow);
}
else
{
// Append query for subsequent rows with or operator to make queries independent of each other.
fullQuery.Append($" {TableOperators.Or} ");
fullQuery.Append(queryForRow);
}
}
return fullQuery.ToString();
}
用于 C# 代码的测试用例与 python 测试完全不同。在 C# 中,我从大约 100000 行的数据中查询 2000 行。如果以 50 行为一组查询数据,则后一个过滤器查询在 50 个任务中胜过单行查询 运行。
也许我应该重复我在 C# 中用 python 作为控制台应用程序所做的测试,看看 .Net 客户端 api 的行为是否与 python 相同性能副。
作为答案发布,因为评论越来越大。
您能否尝试将查询更改为如下内容:
(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0' and RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34' and RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc' and RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56' and RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a' and RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')
我认为你应该使用多线程实现,因为它由多个点查询组成。在单个查询中执行所有操作可能会导致 Table Scan。正如 official doc 提到的:
Using an "or" to specify a filter based on RowKey values results in a partition scan and is not treated as a range query. Therefore, you should avoid queries that use filters such as: $filter=PartitionKey eq 'Sales' and (RowKey eq '121' or RowKey eq '322')
您可能认为上面的示例是两个 点查询,但它实际上会导致 分区扫描。
对我来说,这里的答案似乎是,在 table 存储上执行查询并没有像您期望的那样优化为与 OR 运算符一起工作。当点查询与OR
运算符组合时,查询不作为点查询处理。
这可以在 python、C# 和 Azure 存储资源管理器中重现,其中如果将点查询与 OR
结合使用,它可能比单独的点查询慢 10 倍(甚至更多)查询只有 return 一行。
因此,获取已知分区键和行键的行数的最有效方法是使用 TableOperation.Retrieve
(在 C# 中)通过单独的异步查询来完成所有操作。使用 TableQuery
效率非常低,并且不会产生接近 Azure Table 存储性能可扩展性目标的结果。可扩展性目标例如:"Target throughput for a single table partition (1 KiB-entities) Up to 2,000 entities per second"。在这里,尽管所有行都在不同的分区中,但我什至无法获得每秒 5 行的服务。
这种查询性能的限制在任何文档或性能优化指南中都没有明确说明,但可以从 Azure storage performance checklist:
中的这些行中理解Querying
This section describes proven practices for querying the table service.
Query scope
There are several ways to specify the range of entities to query. The following is a discussion of the uses of each.
In general, avoid scans (queries larger than a single entity), but if you must scan, try to organize your data so that your scans retrieve the data you need without scanning or returning significant amounts of entities you don't need.
Point queries
A point query retrieves exactly one entity. It does this by specifying both the partition key and row key of the entity to retrieve. These queries are efficient, and you should use them wherever possible.
Partition queries
A partition query is a query that retrieves a set of data that shares a common partition key. Typically, the query specifies a range of row key values or a range of values for some entity property in addition to a partition key. These are less efficient than point queries, and should be used sparingly.
Table queries
A table query is a query that retrieves a set of entities that does not share a common partition key. These queries are not efficient and you should avoid them if possible.
所以 "A point query retrieves exactly one entity" 和 "Use point queries when ever possible"。由于我已将数据拆分为多个分区,因此可能已将其处理为 table 查询:"A table query is a query that retrieves a set of entities that does not share a common partition key"。这虽然查询组合了一组点查询,因为它列出了所有预期实体的分区键和行键。但是由于组合查询不是只检索一个查询,因此不能期望它作为点查询(或一组点查询)执行。