如何通过 PySpark 将带有结构列的数据框写入 Elasticsearch

Question

我正在尝试将包含结构列的数据框写入 Elasticsearch：

df1 = spark.createDataFrame([{"date": "2020.04.10","approach": "test", "outlier_score": 1, "a":"1","b":2},
                       {"date": "2020.04.10","approach": "test", "outlier_score": 0, "a":"2","b":1}],
                       )

df1 = df1.withColumn('details', to_json(struct(
   col('a'),
   col('b')
)))

df1.show(truncate=False)

df1.select('date','approach','outlier_score','details').write.format("org.elasticsearch.spark.sql").option('es.resource', 'outliers').save(mode="append")

结果为：

+---+--------+---+----------+-------------+---------------+
|a  |approach|b  |date      |outlier_score|details        |
+---+--------+---+----------+-------------+---------------+
|1  |test    |2  |2020.04.10|1            |{"a":"1","b":2}|
|2  |test    |1  |2020.04.10|0            |{"a":"2","b":1}|
+---+--------+---+----------+-------------+---------------+

这确实有效，但是 JSON 被转义了 ，因此相应的 详细信息 字段在 Kibana 中不可点击：

    {
  "_index": "outliers",
  "_type": "_doc",
  "_id": "NuDSA3IBhHa_VjuWENYR",
  "_version": 1,
  "_score": 0,
  "_source": {
    "date": "2020.04.10",
    "approach": "test",
    "outlier_score": 1,
    "details": "{\"a\":\"1\",\"b\":2}"
  },
  "highlight": {
    "date": [
      "@kibana-highlighted-field@2020.04.10@/kibana-highlighted-field@"
    ]
  }
}

我尝试提供 .option("es.input.json","true")，但出现异常：

org.elasticsearch.hadoop.rest.EsHadoopRemoteException: mapper_parsing_exception: failed to parse;org.elasticsearch.hadoop.rest.EsHadoopRemoteException: not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes

如果我尝试在不转换为 JSON 的情况下写入数据，即从原始代码中删除 to_json(，我会得到另一个异常：

org.elasticsearch.hadoop.rest.EsHadoopRemoteException: mapper_parsing_exception: failed to parse field [details] of type [text] in document with id 'TuDWA3IBhHa_VjuWFNmX'. Preview of field's value: '{a=2, b=1}';org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_state_exception: Can't get text on a START_OBJECT at 1:68
    {"index":{}}
{"date":"2020.04.10","approach":"test","outlier_score":0,"details":{"a":"2","b":1}}

所以问题是如何将嵌套 JSON 列的 PySpark 数据帧写入 Elasticsearch，以便 JSON 不会被转义？

Answer 1

写入数据而不转换为 JSON（没有 to_json）实际上应该不会产生异常。问题是已经为转义的 JSON 字段自动创建了映射。

为了修复异常，应该删除或重新创建索引。之后，将自动为详细信息字段创建映射作为对象。或者，也可以删除所有带有 details 字段的记录，然后将此字段的映射更改为对象类型。

如何通过 PySpark 将带有结构列的数据框写入 Elasticsearch

How to write dataframe with struct column into Elasticsearch via PySpark

elasticsearch

pyspark

elasticsearch-hadoop

elasticsearch-spark

pyspark-dataframes