如何直接将 xml 字符串加载到数据集?而不是从 xml 文件加载?
How can I directly load a xml string to a dataset? Instead of loading from a xml file?
我需要一种直接将 XML 字符串放入数据集的方法。而不是从文件加载它。
SparkSession spark = SparkSession.builder().master("local").getOrCreate();
Dataset<Row> df = spark.read().format("com.databricks.spark.xml")
.option("rowTag", "book").load("books.xml");
df.show();
这只需从文件中调用 xml 即可。有什么方法可以直接向数据集输入 xml 字符串吗?例如,使用字符串 xml 如下所示的字符串。
String xmlString = "<persons>
<person id="1">
<firstname>James</firstname>
<lastname>Smith</lastname>
<middlename></middlename>
<dob_year>1980</dob_year>
<dob_month>1</dob_month>
<gender>M</gender>
<salary currency="Euro">10000</salary>
</person>
</persons>";
问题是我不想使用文件。我只想使用字符串。我知道有一种方法可以将字符串保存到 xml 文件,然后使用新创建的 xml 文件。但除此之外还有其他方法吗?
您可以使用已弃用的 xmlRdd
(这是唯一的解决方案,我现在可以看到)
public static void readFromString() {
SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
String books = "<persons>\n" +
" <person id=\"1\">\n" +
" <firstname>James</firstname>\n" +
" <lastname>Smith</lastname>\n" +
" <middlename></middlename>\n" +
" <dob_year>1980</dob_year>\n" +
" <dob_month>1</dob_month>\n" +
" <gender>M</gender>\n" +
" <salary currency=\"Euro\">10000</salary>\n" +
" </person>\n" +
"</persons>";
List<String> booksList = Arrays.asList(books);
RDD<String> booksRDD = sc.parallelize(booksList, 1).rdd();
Dataset<Row> rowDataset = new XmlReader().withRowTag("person").xmlRdd(new SQLContext(sc), booksRDD);
rowDataset.printSchema();
rowDataset.select("person.*").show();
}
rowDataset.printSchema()
的结果:
root
|-- person: struct (nullable = true)
| |-- _id: long (nullable = true)
| |-- dob_month: long (nullable = true)
| |-- dob_year: long (nullable = true)
| |-- firstname: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- lastname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- salary: struct (nullable = true)
| | |-- _VALUE: long (nullable = true)
| | |-- _currency: string (nullable = true)
rowDataset.select("person.*").show();
的结果
+---+---------+--------+---------+------+--------+----------+------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+------------+
| 1| 1| 1980| James| M| Smith| |[10000,Euro]|
+---+---------+--------+---------+------+--------+----------+------------+
希望对您有所帮助!
我需要一种直接将 XML 字符串放入数据集的方法。而不是从文件加载它。
SparkSession spark = SparkSession.builder().master("local").getOrCreate();
Dataset<Row> df = spark.read().format("com.databricks.spark.xml")
.option("rowTag", "book").load("books.xml");
df.show();
这只需从文件中调用 xml 即可。有什么方法可以直接向数据集输入 xml 字符串吗?例如,使用字符串 xml 如下所示的字符串。
String xmlString = "<persons>
<person id="1">
<firstname>James</firstname>
<lastname>Smith</lastname>
<middlename></middlename>
<dob_year>1980</dob_year>
<dob_month>1</dob_month>
<gender>M</gender>
<salary currency="Euro">10000</salary>
</person>
</persons>";
问题是我不想使用文件。我只想使用字符串。我知道有一种方法可以将字符串保存到 xml 文件,然后使用新创建的 xml 文件。但除此之外还有其他方法吗?
您可以使用已弃用的 xmlRdd
(这是唯一的解决方案,我现在可以看到)
public static void readFromString() {
SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
String books = "<persons>\n" +
" <person id=\"1\">\n" +
" <firstname>James</firstname>\n" +
" <lastname>Smith</lastname>\n" +
" <middlename></middlename>\n" +
" <dob_year>1980</dob_year>\n" +
" <dob_month>1</dob_month>\n" +
" <gender>M</gender>\n" +
" <salary currency=\"Euro\">10000</salary>\n" +
" </person>\n" +
"</persons>";
List<String> booksList = Arrays.asList(books);
RDD<String> booksRDD = sc.parallelize(booksList, 1).rdd();
Dataset<Row> rowDataset = new XmlReader().withRowTag("person").xmlRdd(new SQLContext(sc), booksRDD);
rowDataset.printSchema();
rowDataset.select("person.*").show();
}
rowDataset.printSchema()
的结果:
root
|-- person: struct (nullable = true)
| |-- _id: long (nullable = true)
| |-- dob_month: long (nullable = true)
| |-- dob_year: long (nullable = true)
| |-- firstname: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- lastname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- salary: struct (nullable = true)
| | |-- _VALUE: long (nullable = true)
| | |-- _currency: string (nullable = true)
rowDataset.select("person.*").show();
+---+---------+--------+---------+------+--------+----------+------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+------------+
| 1| 1| 1980| James| M| Smith| |[10000,Euro]|
+---+---------+--------+---------+------+--------+----------+------------+
希望对您有所帮助!