使用 spark xml 读取值 xml 标签值,想获取值但给我列表
Reading value xml tag value using spark xml , want to get the value but give me the list
<row id='185685445477437.020001' xml:space='preserve'>
<c2>KH0013001</c2>
<c3>-2271164.00</c3>
<c4>9</c4>
<c7>65395</c7>
<c9>1</c9>
<c12>KHR</c12>
<c16>TR</c16>
<c17>6-71-10-1-001-030</c17>
<c20>1</c20>
<c22>1</c22>
<c23>DC183050001030071</c23>
<c24>DC</c24>
<c25>20181101</c25>
<c26>185685445477437.02</c26>
<c26 m='3'>1</c26>
<c29>1</c29>
<c30>5011_DMUSER__OFS_DM.OFS.SRC.VAL</c30>
<c31>1811012130</c31>
<c32>6010_DMUSER</c32>
<c56>PL.65395.......1.....KH0013001</c56>
<c98></c98>
</row>
火花与火花 XML
import org.apache.spark.sql.{SQLContext, SparkSession}
object sparkXml {
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder.master("local[*]")
//.config("spark.debug.maxToStringFields", "100")
.appName("Insight Application Big Data")
.getOrCreate()
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "row")
.load("src/main/resources/in/FBNK_CATEG_ENTRY.xml")
df.createOrReplaceTempView("categ_entry")
df.printSchema()
spark.sql("Select c26['_VALUE'] as value, c26['_m'] as option from categ_entry").show()
}
}
printSchema
root
|-- _id: double (nullable = true)
|-- _space: string (nullable = true)
|-- c12: string (nullable = true)
|-- c16: string (nullable = true)
|-- c17: string (nullable = true)
|-- c2: string (nullable = true)
|-- c20: long (nullable = true)
|-- c22: long (nullable = true)
|-- c23: string (nullable = true)
|-- c24: string (nullable = true)
|-- c25: long (nullable = true)
|-- c26: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: double (nullable = true)
| | |-- _m: long (nullable = true)
|-- c29: long (nullable = true)
|-- c3: double (nullable = true)
|-- c30: string (nullable = true)
|-- c31: long (nullable = true)
|-- c32: string (nullable = true)
|-- c4: long (nullable = true)
|-- c56: string (nullable = true)
|-- c7: long (nullable = true)
|-- c9: long (nullable = true)
|-- c98: string (nullable = true)
运行
后的结果
+--------------------+------+
|[1.85685445477437...| [, 3]|
+--------------------+------+
我希望结果是这样的。
+--------------------+------+
| 185685445477437.02 | 3 |
+--------------------+------+
任何人请指导我我应该如何更正代码以产生预期的结果
您拥有数据的方式很难产生您想要的输出。
<c26>185685445477437.02</c26>
<c26 m='3'>1</c26>
您有两个标签,spark 结构将其视为一个数组。你想要第一个 c26 _value 中的 185685445477437.02 和第二个 c26 _attribute 中的 3,这可能不正确。
如果您正在寻找以下输出。用以下语句替换你的最后一行
val df2 = df.withColumn("c26Struct",explode(col("c26")))
df2.select(col("c26Struct._VALUE").alias("value"),col("c26Struct._m").alias("option") ).show(false)
+---------------------+------+
|value |option|
+---------------------+------+
|1.8568544547743703E14|null |
|1.0 |3 |
+---------------------+------+
在这里,我通过展开 cr26 数组并从展开的新列中选择值创建了一个新列 c26Struct。
希望对您有所帮助!!
谢谢,
纳文
<row id='185685445477437.020001' xml:space='preserve'>
<c2>KH0013001</c2>
<c3>-2271164.00</c3>
<c4>9</c4>
<c7>65395</c7>
<c9>1</c9>
<c12>KHR</c12>
<c16>TR</c16>
<c17>6-71-10-1-001-030</c17>
<c20>1</c20>
<c22>1</c22>
<c23>DC183050001030071</c23>
<c24>DC</c24>
<c25>20181101</c25>
<c26>185685445477437.02</c26>
<c26 m='3'>1</c26>
<c29>1</c29>
<c30>5011_DMUSER__OFS_DM.OFS.SRC.VAL</c30>
<c31>1811012130</c31>
<c32>6010_DMUSER</c32>
<c56>PL.65395.......1.....KH0013001</c56>
<c98></c98>
</row>
火花与火花 XML
import org.apache.spark.sql.{SQLContext, SparkSession}
object sparkXml {
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder.master("local[*]")
//.config("spark.debug.maxToStringFields", "100")
.appName("Insight Application Big Data")
.getOrCreate()
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "row")
.load("src/main/resources/in/FBNK_CATEG_ENTRY.xml")
df.createOrReplaceTempView("categ_entry")
df.printSchema()
spark.sql("Select c26['_VALUE'] as value, c26['_m'] as option from categ_entry").show()
}
}
printSchema
root
|-- _id: double (nullable = true)
|-- _space: string (nullable = true)
|-- c12: string (nullable = true)
|-- c16: string (nullable = true)
|-- c17: string (nullable = true)
|-- c2: string (nullable = true)
|-- c20: long (nullable = true)
|-- c22: long (nullable = true)
|-- c23: string (nullable = true)
|-- c24: string (nullable = true)
|-- c25: long (nullable = true)
|-- c26: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: double (nullable = true)
| | |-- _m: long (nullable = true)
|-- c29: long (nullable = true)
|-- c3: double (nullable = true)
|-- c30: string (nullable = true)
|-- c31: long (nullable = true)
|-- c32: string (nullable = true)
|-- c4: long (nullable = true)
|-- c56: string (nullable = true)
|-- c7: long (nullable = true)
|-- c9: long (nullable = true)
|-- c98: string (nullable = true)
运行
后的结果+--------------------+------+
|[1.85685445477437...| [, 3]|
+--------------------+------+
我希望结果是这样的。
+--------------------+------+
| 185685445477437.02 | 3 |
+--------------------+------+
任何人请指导我我应该如何更正代码以产生预期的结果
您拥有数据的方式很难产生您想要的输出。
<c26>185685445477437.02</c26>
<c26 m='3'>1</c26>
您有两个标签,spark 结构将其视为一个数组。你想要第一个 c26 _value 中的 185685445477437.02 和第二个 c26 _attribute 中的 3,这可能不正确。
如果您正在寻找以下输出。用以下语句替换你的最后一行
val df2 = df.withColumn("c26Struct",explode(col("c26")))
df2.select(col("c26Struct._VALUE").alias("value"),col("c26Struct._m").alias("option") ).show(false)
+---------------------+------+
|value |option|
+---------------------+------+
|1.8568544547743703E14|null |
|1.0 |3 |
+---------------------+------+
在这里,我通过展开 cr26 数组并从展开的新列中选择值创建了一个新列 c26Struct。
希望对您有所帮助!!
谢谢, 纳文