Pyspark JSON 文件中缺失值的零替换

Zero Replacement for missing values in JSON file for Pyspark

JSON 如下所示。

{
"ThresholdTime": "48min", 
"FallTime": "Min", 
"description": "PowerAmplifier"
}
{
"ThresholdTime": "min", 
"FallTime": "200min", 
"description": "DolbyDigitall"
}

我正在使用 regexp_extract 从字母数字字符串中删除字母字符。

df.withColumn("NewThresholdTime",regexp_extract("ThresholdTime","(\d+)",1))

如何在没有时间 ThresholdTimeFallTime 的情况下添加 0?

输出应该是:

+--------+-------------+--------------+----------------+    
|FallTime|ThresholdTime|   NewFallTime|NewThresholdTime|    
+--------+-------------+--------------+----------------+    
|   Min  |        48min|0             |          48    |
|  200min|          min|200           |          0     |    
+--------+-------------+--------------+----------------+

假设我们有一个包含 JSON 中提供的值的数据框,您可以检查如果没有数字,列是否保持不变,然后保持原样,否则删除字母。

df = sqlContext.createDataFrame(
    [{"ThresholdTime": "48min", 
      "FallTime": "15Min", 
      "description": "PowerAmplifier"
    },
    {"ThresholdTime": "min", 
     "FallTime": "200min", 
     "description": "DolbyDigitall"}])

# What would column look like without alhpabets
col_without_alphabets = F.regexp_replace(df["ThresholdTime"], "[a-zA-Z]", "")

# What would column look like without numerals
col_without_numerals = F.regexp_replace(df["ThresholdTime"], "[0-9]", "")

# If without numerals the column remains the same then keep as-is, else remove alphabets
df.withColumn("NewThresholdTime",
              F.when(col_without_numerals == df["ThresholdTime"], 
                     F.lit(0))
              .otherwise(col_without_alphabets)).show()

输出:

+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime|   description|NewThresholdTime|
+--------+-------------+--------------+----------------+
|   15Min|        48min|PowerAmplifier|              48|
|  200min|          min| DolbyDigitall|               0|
+--------+-------------+--------------+----------------+

添加答案以对任意数量的变量进行扩展。

循环遍历您想对其执行相同操作的任何变量。

new_columns = list()
for column in ["ThresholdTime", "FallTime"]:

    # What would column look like without alphabets
    col_without_alphabets = F.regexp_replace(df[column], "[a-zA-Z]", "")

    # What would column look like without numerals
    col_without_numerals = F.regexp_replace(df[column], "[0-9]", "")

    # If without numerals the column remains the same then keep as-is, else remove alphabets
    new_columns.append(F.when(col_without_numerals == df[column], 
                        F.lit(0)).otherwise(col_without_alphabets).alias("New{}".format(column)))

df.select(["*"] + new_columns).show()

输出:

+--------+-------------+--------------+----------------+-----------+
|FallTime|ThresholdTime|   description|NewThresholdTime|NewFallTime|
+--------+-------------+--------------+----------------+-----------+
|   15Min|        48min|PowerAmplifier|              48|         15|
|  200min|          min| DolbyDigitall|               0|        200|
+--------+-------------+--------------+----------------+-----------+