拆分 .ttl 或 .nt 文件 - Spark Scala
Splitting .ttl or .nt file - Spark Scala
我是 scala 的新手,我需要逐行读取 ttl 文件并拆分特定的分隔符并提取值以放入数据框中的相应列。
< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .
我想要这个输出
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|http://website/Jimmy_Car |http://web/name |"James Earl Carter |
|http:///website/Jimmy_Car |http://web/country |http://web/country |
|http://website/Jimmy_Car |http://web/birthPlace |http://web/Georgia_(US) |
|
我试过这段代码
case class T(S: Option[String], P: Option[String],O:Option[String])
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\< |\> |\ . ")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
我得到了这个结果
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|<http://website/Jimmy_Car |<http://web/name |"James |
|<http:///website/Jimmy_Car |<http://web/country |<http://web/country |
|<http://website/Jimmy_Car |<http://web/birthPlace |<http://web/Georgia_(US)
为了删除每个三元组开头的分隔符“<”,我在拆分中添加了“|<”
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\< |\> |\ . |<")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
我得到了这个结果
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
| |http://web/name | |
| |http://web/country | |
| |http://web/birthPlace |
我该如何解决这个问题
如果您不清楚如何使用 Spark 中的内置正则表达式功能替换您的代码,请在下方找到答案。尽管您需要确保在使用此方法之前了解正则表达式的工作原理。
val df = Seq(
("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
).toDF("S", "P", "O")
val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
.withColumn("P", regexp_extract($"P", url_regex, 1))
.withColumn("O", regexp_extract($"O", url_regex, 1))
这将输出:
+---------------------------+---------------------+----------------------------+
|S |P |O |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name |James Earl Carter |
|http://website/Jimmy_Car |http://web/country |http://website/United_States|
|http://website/Jimmy_Car |http://web/birthPlace|http://web/Georgia_(US) |
+---------------------------+---------------------+----------------------------+
稍微解释一下正则表达式的工作原理,即使这不是 post 的主题。
(?:"|<{1}\s?)
识别以 "
或 <
或 <
开头的行
(.*)
将匹配内容提取到第 1 组
(?:>(?:\s\.)?|,\s.*)
确定以 >
或 > .
或 ,\s.*
结尾的行,最后一个是 James Earl 案例
你不能像这样阅读 Turtle 文件。另外,正则表达式是一种非常天真的阅读 N-Triples 的方式。不要重新发明轮子,用https://github.com/banana-rdf/banana-rdf
我是 scala 的新手,我需要逐行读取 ttl 文件并拆分特定的分隔符并提取值以放入数据框中的相应列。
< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .
我想要这个输出
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|http://website/Jimmy_Car |http://web/name |"James Earl Carter |
|http:///website/Jimmy_Car |http://web/country |http://web/country |
|http://website/Jimmy_Car |http://web/birthPlace |http://web/Georgia_(US) |
|
我试过这段代码
case class T(S: Option[String], P: Option[String],O:Option[String])
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\< |\> |\ . ")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
我得到了这个结果
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
|<http://website/Jimmy_Car |<http://web/name |"James |
|<http:///website/Jimmy_Car |<http://web/country |<http://web/country |
|<http://website/Jimmy_Car |<http://web/birthPlace |<http://web/Georgia_(US)
为了删除每个三元组开头的分隔符“<”,我在拆分中添加了“|<”
val triples = sc.textFile("triples_test.ttl").map(_.split(" |\< |\> |\ . |<")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
我得到了这个结果
+-------------------------------+---------------------------+-----------------------------
|S |P |O |
+-------------------------------+---------------------------+-----------------------------
| |http://web/name | |
| |http://web/country | |
| |http://web/birthPlace |
我该如何解决这个问题
如果您不清楚如何使用 Spark 中的内置正则表达式功能替换您的代码,请在下方找到答案。尽管您需要确保在使用此方法之前了解正则表达式的工作原理。
val df = Seq(
("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
).toDF("S", "P", "O")
val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
.withColumn("P", regexp_extract($"P", url_regex, 1))
.withColumn("O", regexp_extract($"O", url_regex, 1))
这将输出:
+---------------------------+---------------------+----------------------------+
|S |P |O |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name |James Earl Carter |
|http://website/Jimmy_Car |http://web/country |http://website/United_States|
|http://website/Jimmy_Car |http://web/birthPlace|http://web/Georgia_(US) |
+---------------------------+---------------------+----------------------------+
稍微解释一下正则表达式的工作原理,即使这不是 post 的主题。
(?:"|<{1}\s?)
识别以"
或<
或<
开头的行
(.*)
将匹配内容提取到第 1 组(?:>(?:\s\.)?|,\s.*)
确定以>
或> .
或,\s.*
结尾的行,最后一个是 James Earl 案例
你不能像这样阅读 Turtle 文件。另外,正则表达式是一种非常天真的阅读 N-Triples 的方式。不要重新发明轮子,用https://github.com/banana-rdf/banana-rdf