从 PySpark 查询日期之间的 Vertica

Question

我有 Spark 1.6 运行超过 Python 3.4，从我的 Vertica 数据库中检索数据以处理下面的查询，Spark DataFrames 支持使用 JDBC 源的谓词下推但术语 predicate 用于严格的 SQL 含义。这意味着它只涵盖 WHERE 子句。此外，它看起来仅限于逻辑连词（恐怕没有 IN 和 OR）和简单的谓词，它显示了这个错误：java.lang.RuntimeException: 未指定选项 'dbtable'

数据库中包含大约1000亿的海量数据，我无法检索数据 spark1.6 不允许我使用仅查询 dbtable 作为 schema.table，我得到以下错误：

java.lang.RuntimeException: Option 'dbtable' not specified

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = "*******"
properties = {"user": "*****", "password": "*******", "driver": "com.vertica.jdbc.Driver" }

df = sqlContext.read.format("JDBC").options(
    url = url,
    query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
    **properties
).load()

df.show()

我已经尝试了下面的查询但没有结果在没有使用限制函数的情况下花费了很长时间

query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE date(time_stamp) between '2019-01-27' AND '2019-01-29'"
df = sqlContext.read.format("JDBC").options(
    url = url,
    dbtable="( " + query + " ) as temp",
    **properties
).load()

是否可以像上面那样读取数据或使用特定查询将其读取为数据帧？

我试图通过设置更多条件和限制来减少时间，但它在 $\conditions 上被拒绝，即使删除它给我的条件 "Subquery in FROM must have an alias"，这是查询：

SELECT min(date(time_stamp)) AS mindate,max(date(time_stamp)) AS maxdate,count (distinct date(time_stamp)) AS noofdays, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, sum(bytes_in) AS DL, sum(bytes_out) AS UL, sum(connections_out) AS conn from traffic.stats where SUBSCRIBER like '41601%' and date(time_stamp) between '2019-01-25' and '2019-01-29'and signature_service_category = 'Web Browsing' and (signature_service_name = 'SSL v3' or signature_service_name = 'HTTP2 over TLS') and server_hostname not like '%.googleapis.%' and server_hostname not like '%.google.%' and server_hostname <> 'doubleclick.net'  and server_hostname <> 'youtube.com'  and server_hostname <> 'googleadservices.com'  and server_hostname <> 'app-measurement.com' and server_hostname <> 'gstatic.com' and server_hostname <> 'googlesyndication.com' and server_hostname <> 'google-analytics.com'  and server_hostname <> 'googleusercontent.com'  and server_hostname <> 'ggpht.com'  and server_hostname <> 'googletagmanager.com' and server_hostname is not null group by subscriber, server_hostname

Answer 1

如果查询在日期范围之间过滤需要一个多小时，您应该考虑编写预测。

CREATE PROJECTION traffic.status_date_range
(
  time_stamp,
  subscriber,
  server_hostname,
  bytes_in,
  bytes_out,
  connections_out
)
AS
  SELECT
    time_stamp,
    subscriber,
    server_hostname,
    bytes_in,
    bytes_out,
    connections_out
  FROM traffic.stats
  ORDER BY time_stamp
SEGMENTED BY HASH(time_stamp) ALL NODES;

像这样创建特定于查询的投影可能会增加大量磁盘空间 space，但如果性能对您来说真的很重要，那么这可能是值得的。

我还建议您对 table 进行分区，如果您还没有这样做的话。根据 traffic.stats table 中有多少个不同的日期，您可能不想按天进行分区。每个分区至少创建 1 个 ROS 容器（有时更多）。因此，如果您有 1024 个或更多不同的日期，那么 Vertica 甚至不允许您按日期分区，在这种情况下您可以按月分区。如果您使用的是 Vertica 9，那么您可以利用分层分区（您可以阅读相关内容 here）。

我要提醒的是，在运行之后重组 table ALTER TABLE 语句以添加分区子句将需要大量磁盘 space，因为 Vertica 写入数据到新文件。完成后，table 将占用与现在几乎相同数量的 space，但在分区时，您的磁盘 space 可能会变得非常大。

从 PySpark 查询日期之间的 Vertica

Query Vertica between dates from PySpark

vertica

apache-spark

pyspark

pyspark-sql