在 Spark UDF 和 return 结构字段中处理 XML 字符串
Processing XML string inside Spark UDF and return Struct Field
我有一个名为 Body(String) 的数据框列。 body 列数据如下所示
<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I build the application, it gives the following error:</p>
<blockquote>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
</blockquote>
<p>I tried using <code>trans</code> and <code>double</code> but then the
control doesn't work. This code worked fine in a past VB.NET project. </p>
,While applying opacity to a form should we use a decimal or double value?
使用 Body 我想准备两个单独的列代码和文本。代码位于名为代码的元素之间,文本是其他所有内容。
我创建了一个如下所示的 UDF
case class bodyresults(text:String,code:String)
val Body:String=>bodyresults=(body:String)=>{ val xmlbody=scala.xml.XML.loadString(body)
val code = (xmlbody \ "code").toString;
val text = "I want every thing else as text. what should I do"
(text,code)
}
val bodyudf=udf(Body)
val posts5=posts4.withColumn("codetext",bodyudf(col("Body")))
这是行不通的。我的问题是
1.As可以看到数据中没有根节点。我还能使用 scala XML 解析吗?
2. 如何将除代码以外的所有内容解析为文本。
如果我的代码有问题请告诉我
预期输出:
(code,text)
code = decimal trans = trackBar1.Value / 5000;this.Opacity = trans;trans double
text = everything else
除了替换,您还可以使用 RewriteRule
并重写 XML class 的 transform
方法以清空您的 <pre>
标记xml.
case class bodyresults(text:String,code:String)
val bodyudf = udf{ (body: String) =>
// Appending body tag explicitly to the xml before parsing
val xmlElems = XML.loadString(s""" <body> ${body} </body> """)
// extract the code inside the req
val code = (xmlElems \ "body" \ "pre" \ "code").text
val text = (xmlElems \ "body").text.replaceAll(s"${code}" ,"" )
bodyresults(text, code)
}
这个 UDF 将 return 一个 StructType
像:
org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,StructType(StructField(text,StringType,true), StructField(code,StringType,true)),List(StringType))
您现在可以在 posts5
数据框上调用它,例如:
val posts5 = df.withColumn("codetext", bodyudf($"xml") )
posts5: org.apache.spark.sql.DataFrame = [xml: string, codetext: struct<text:string,code:string>]
要提取特定列:
posts5.select($"codetext.code" ).show
+--------------------+
| code|
+--------------------+
|decimal trans = t...|
+--------------------+
我有一个名为 Body(String) 的数据框列。 body 列数据如下所示
<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I build the application, it gives the following error:</p>
<blockquote>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
</blockquote>
<p>I tried using <code>trans</code> and <code>double</code> but then the
control doesn't work. This code worked fine in a past VB.NET project. </p>
,While applying opacity to a form should we use a decimal or double value?
使用 Body 我想准备两个单独的列代码和文本。代码位于名为代码的元素之间,文本是其他所有内容。
我创建了一个如下所示的 UDF
case class bodyresults(text:String,code:String)
val Body:String=>bodyresults=(body:String)=>{ val xmlbody=scala.xml.XML.loadString(body)
val code = (xmlbody \ "code").toString;
val text = "I want every thing else as text. what should I do"
(text,code)
}
val bodyudf=udf(Body)
val posts5=posts4.withColumn("codetext",bodyudf(col("Body")))
这是行不通的。我的问题是 1.As可以看到数据中没有根节点。我还能使用 scala XML 解析吗? 2. 如何将除代码以外的所有内容解析为文本。
如果我的代码有问题请告诉我
预期输出:
(code,text)
code = decimal trans = trackBar1.Value / 5000;this.Opacity = trans;trans double
text = everything else
除了替换,您还可以使用 RewriteRule
并重写 XML class 的 transform
方法以清空您的 <pre>
标记xml.
case class bodyresults(text:String,code:String)
val bodyudf = udf{ (body: String) =>
// Appending body tag explicitly to the xml before parsing
val xmlElems = XML.loadString(s""" <body> ${body} </body> """)
// extract the code inside the req
val code = (xmlElems \ "body" \ "pre" \ "code").text
val text = (xmlElems \ "body").text.replaceAll(s"${code}" ,"" )
bodyresults(text, code)
}
这个 UDF 将 return 一个 StructType
像:
org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,StructType(StructField(text,StringType,true), StructField(code,StringType,true)),List(StringType))
您现在可以在 posts5
数据框上调用它,例如:
val posts5 = df.withColumn("codetext", bodyudf($"xml") )
posts5: org.apache.spark.sql.DataFrame = [xml: string, codetext: struct<text:string,code:string>]
要提取特定列:
posts5.select($"codetext.code" ).show
+--------------------+
| code|
+--------------------+
|decimal trans = t...|
+--------------------+