BigTable:是否有更好的方法从部分行键中获取唯一值?

BigTable: Is there a better approach to get unique values from partial row keys?

我在 <name>#<date>#<id_value>

中创建了一个带有行键的大表

并且我希望在使用如下行键前缀进行过滤时获得唯一 ID。

client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)
prefix = "phone#20190501"
end_key = prefix[:-1] + chr(ord(prefix[-1]) + 1)

# example row keys = ['phone#20190501#<id_value>', 'phone#20190501#<id_value>'...]

row_set = RowSet()
row_set.add_row_range_from_keys(prefix.encode("utf-8"),
                                end_key.encode("utf-8"))

rows = table.read_rows(row_set=row_set)
id_values = []
for row in rows:
    # get last id_value from row key
    id_value = str(row.key).replace('phone#20190501#', '')
    id_values.append(id_value)
unique_id_list = list(set(id_values))
print('COUNT: %s' % len(unique_id_list))

但是,我想知道如果我读取超过 1 亿行,我认为这种计算唯一性的方法 id_value 可能会占用大量内存并且 cpu.

有没有更好的方法来计算 Bigtable 中的唯一 ID 或标准中的“UNIQUE”功能 SQL

Bigtable 无法像 SQL 那样 sort/unique,它必须在客户端通过代码完成。但是,有一些性能注意事项可以帮助您。 您可以在下面找到它 :

  1. In case query speed is a must, loading the data into BigQuery instead of setting up an external data source would be the most efficient way. Nevertheless, there are some things you can do to improve BigQuery, or BigTable performances.

  2. This connector is still in the Beta stage, and has some performance considerations. We should also take into consideration that BigTable is a noSQL (non relational) database and is not intended for SQL queries. In case you are exploring the data model you want to use in your application, I recommend you consider all these options and choose the one that fits better with your needs.

  3. I would say it is not a good choice if you want to query your data using SQL. Understanding de non relational architecture of BigTable, the most effective way to read your data would be sending read requests. You can find some code samples about this, in different languages in the official documentation.