如何通过删除空列来创建变量 PySpark 数据帧
How to create variable PySpark Dataframes by Dropping Null columns
我在名为 'source_data'
的相对文件夹中有 2 个 JSON 文件
"source_data/data1.json"
{
"name": "John Doe",
"age": 32,
"address": "ZYZ - Heaven"
}
"source_data/data2.json"
{
"userName": "jdoe",
"password": "password",
"salary": "123456789"
}
我使用以下 PySpark 代码创建了 DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("source_data")
print(df.head())
输出:
df.head(10)
[Row(name='John Doe', age=32, address='ZYZ - Heaven', userName=None, password=None, salary=None),
Row(name=None, age=None, address=None, userName='jdoe', password='password', salary='123456789')]
现在我想通过删除 'None' 类型的列值来创建可变数量的 DataFrame,如下所示:
df1.head()
[Row(name='John Doe', age=32, address='ZYZ - Heaven']
并且,
df2.head()
[Row(userName='jdoe', password='password', salary='123456789')]
我只找到基于所有或任何列删除整行的解决方案
有什么方法可以实现我想要的吗?
TIA
您可以 select 将您需要的列放在不同的数据框中,然后根据条件进行过滤。
//source data
val df = spark.read.json("path")
//select and filter
val df1 = df.select("address","age","name")
.filter($"address".isNotNull || $"age".isNotNull || $"name".isNotNull)
val df2 = df.select("password","salary","userName")
.filter($"password".isNotNull || $"salary".isNotNull || $"userName".isNotNull)
//see the output as dataframe or using head as you want
println(df1.head)
df2.head
head 命令的输出
df1 :
df2:
我在名为 'source_data'
的相对文件夹中有 2 个 JSON 文件"source_data/data1.json"
{
"name": "John Doe",
"age": 32,
"address": "ZYZ - Heaven"
}
"source_data/data2.json"
{
"userName": "jdoe",
"password": "password",
"salary": "123456789"
}
我使用以下 PySpark 代码创建了 DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("source_data")
print(df.head())
输出:
df.head(10)
[Row(name='John Doe', age=32, address='ZYZ - Heaven', userName=None, password=None, salary=None), Row(name=None, age=None, address=None, userName='jdoe', password='password', salary='123456789')]
现在我想通过删除 'None' 类型的列值来创建可变数量的 DataFrame,如下所示:
df1.head()
[Row(name='John Doe', age=32, address='ZYZ - Heaven']
并且,
df2.head()
[Row(userName='jdoe', password='password', salary='123456789')]
我只找到基于所有或任何列删除整行的解决方案
有什么方法可以实现我想要的吗?
TIA
您可以 select 将您需要的列放在不同的数据框中,然后根据条件进行过滤。
//source data
val df = spark.read.json("path")
//select and filter
val df1 = df.select("address","age","name")
.filter($"address".isNotNull || $"age".isNotNull || $"name".isNotNull)
val df2 = df.select("password","salary","userName")
.filter($"password".isNotNull || $"salary".isNotNull || $"userName".isNotNull)
//see the output as dataframe or using head as you want
println(df1.head)
df2.head
head 命令的输出 df1 :
df2: