如何在 "dask-sql" SQL 查询的过滤器中使用索引

How to use index in filter in a "dask-sql" SQL query

我创建了一个以时间戳为索引的示例 dask 数据框。

df = dask.datasets.timeseries()

df.head()
                       id      name         x         y
timestamp                                              
2000-01-01 00:00:00   915   Norbert -0.989381  0.974546
2000-01-01 00:00:01  1026     Zelda  0.919731  0.656581
2000-01-01 00:00:02  1003  Patricia -0.128303 -0.354592
2000-01-01 00:00:03   986     Jerry  0.557732  0.160812

现在我想在 SQL 查询中使用 dask-sql 和索引过滤器。但是这不起作用:

from dask_sql import Context

c = Context()
c.create_table("mytab", df)

result = c.sql("""
        SELECT
            count(*)
        FROM mytab
        WHERE "timestamp" > '2000-01-01 00:00:00'
    """)
print(result.compute())

错误信息是:

Traceback (most recent call last):
  File "/opt/dask_sql/startup_script.py", line 15, in <module>
    result = c.sql("""
  File "/opt/dask_sql/dask_sql/context.py", line 458, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/opt/dask_sql/dask_sql/context.py", line 892, in _get_ral
    raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: From line 4, column 15 to line 4, column 25: Column 'timestamp' not found in any table

The problem is probably somewhere here:

    
            SELECT count(*)
            FROM timeseries
            WHERE "timestamp" > '2000-01-01'
                  ^^^^^^^^^^^

我正在使用这张 docker 图片 nbraun/dask-sql:2022.1.0

有没有一种有效的方法可以根据索引过滤器获取所有行?重要的是,这可以在 dask-sql 中完成,因为我需要通过 dask-sql-server.

提供的 presto 端点执行 SQL

dask-sql 似乎没有将“时间戳”标识为 index-column-name,因此一种解决方法是使用 reset_index:

import dask
import dask.dataframe as dd

from dask_sql import Context


ddf = dask.datasets.timeseries()


c = Context()
c.create_table("mytab", ddf.reset_index())


result = c.sql("""
        SELECT
            count(*)
        FROM mytab
        WHERE "timestamp" > '2000-01-01 00:00:00'
    """)

print(result.compute())

在这个具体的例子中,我们得到 TypeError('Invalid comparison between dtype=datetime64[ns] and datetime') 因为 pandas/Dask 使用 datetime64ns 格式。您可以使用以下内容将“时间戳”列转换为 datetime 格式:

import datetime

c.create_table("mytab", ddf.reset_index().assign(timestamp = lambda df: df["timestamp"].apply(lambda x: x.strftime('%Y-%m-%d'), meta=('timestamp', 'object'))))

类似于,

ddf_new = ddf.reset_index()

ddf_new["timestamp"] = ddf_new["timestamp"].apply(lambda x: x.strftime('%Y-%m-%d'), meta=('timestamp', 'object'))

c.create_table(ddf_new)

我还鼓励您在 dask-sql issue tracker 上打开相关问题以直接联系团队。 :)