如何按多值列筛选 JSON 数据
How to filter JSON data by multi-value column
在 Spark SQL 的帮助下,我正在尝试过滤掉属于特定组类别的所有业务项目。
数据从JSON文件加载:
businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json')
businessDF = sqlContext.read.json(businessJSON)
文件的架构如下:
businessDF.printSchema()
root
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
..
|-- type: string (nullable = true)
我正在尝试提取与餐厅业务相关的所有业务:
restaurants = businessDF[businessDF.categories.inSet("Restaurants")]
但它不起作用,因为据我了解,列的预期类型应该是字符串,但在我的例子中,这是数组。关于它告诉我一个例外:
Py4JJavaError: An error occurred while calling o1589.filter.
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;
能否请您建议任何其他方式来获得我想要的东西?
UDF 怎么样?
from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import BooleanType
contains = udf(lambda xs, val: val in xs, BooleanType())
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])])
df.select(contains(df.categories, lit("foo"))).show()
## +----------------------------------+
## |PythonUDF#<lambda>(categories,foo)|
## +----------------------------------+
## | true|
## +----------------------------------+
df.select(contains(df.categories, lit("foobar"))).show()
## +-------------------------------------+
## |PythonUDF#<lambda>(categories,foobar)|
## +-------------------------------------+
## | false|
## +-------------------------------------+
在 Spark SQL 的帮助下,我正在尝试过滤掉属于特定组类别的所有业务项目。
数据从JSON文件加载:
businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json')
businessDF = sqlContext.read.json(businessJSON)
文件的架构如下:
businessDF.printSchema()
root
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
..
|-- type: string (nullable = true)
我正在尝试提取与餐厅业务相关的所有业务:
restaurants = businessDF[businessDF.categories.inSet("Restaurants")]
但它不起作用,因为据我了解,列的预期类型应该是字符串,但在我的例子中,这是数组。关于它告诉我一个例外:
Py4JJavaError: An error occurred while calling o1589.filter.
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;
能否请您建议任何其他方式来获得我想要的东西?
UDF 怎么样?
from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import BooleanType
contains = udf(lambda xs, val: val in xs, BooleanType())
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])])
df.select(contains(df.categories, lit("foo"))).show()
## +----------------------------------+
## |PythonUDF#<lambda>(categories,foo)|
## +----------------------------------+
## | true|
## +----------------------------------+
df.select(contains(df.categories, lit("foobar"))).show()
## +-------------------------------------+
## |PythonUDF#<lambda>(categories,foobar)|
## +-------------------------------------+
## | false|
## +-------------------------------------+