拆分 .ttl 或 .nt 文件 - Spark Scala

Splitting .ttl or .nt file - Spark Scala

我是 scala 的新手,我需要逐行读取 ttl 文件并拆分特定的分隔符并提取值以放入数据框中的相应列。

< http://website/Jimmy_Carter> <http://web/name> "James Earl Carter, Jr."@ko .
< http://website/Jimmy_Car> <http://web/country> <http://website/United_States> .
< http://website/Jimmy_Car> <http://web/birthPlace> <http://web/Georgia_(US)> .

我想要这个输出

+-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|http://website/Jimmy_Car       |http://web/name            |"James Earl Carter                                                       |
|http:///website/Jimmy_Car      |http://web/country         |http://web/country                   |
|http://website/Jimmy_Car       |http://web/birthPlace      |http://web/Georgia_(US)             |
|

我试过这段代码

case class T(S: Option[String], P: Option[String],O:Option[String])


 val triples = sc.textFile("triples_test.ttl").map(_.split(" |\< |\> |\ . ")).map(p => 
  T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()

我得到了这个结果

    +-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|<http://website/Jimmy_Car       |<http://web/name            |"James                                                       |
|<http:///website/Jimmy_Car      |<http://web/country         |<http://web/country                   |
|<http://website/Jimmy_Car       |<http://web/birthPlace      |<http://web/Georgia_(US) 

为了删除每个三元组开头的分隔符“<”,我在拆分中添加了“|<”

 val triples = sc.textFile("triples_test.ttl").map(_.split(" |\< |\> |\ . |<")).map(p => 
  T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()

我得到了这个结果

    +-------------------------------+---------------------------+-----------------------------
|S                              |P                          |O                                                             |
+-------------------------------+---------------------------+-----------------------------

|                                |http://web/name            |                                                      |
|                                |http://web/country         |                   |
|                                |http://web/birthPlace      | 

我该如何解决这个问题

如果您不清楚如何使用 Spark 中的内置正则表达式功能替换您的代码,请在下方找到答案。尽管您需要确保在使用此方法之前了解正则表达式的工作原理。

val df = Seq(
        ("< http://website/Jimmy_Carter>", "<http://web/name>", "\"James Earl Carter, Jr.\"@ko .\""),
        ("< http://website/Jimmy_Car>", "<http://web/country>", "<http://website/United_States> ."),
        ("< http://website/Jimmy_Car>", "<http://web/birthPlace>", "<http://web/Georgia_(US)> .")
    ).toDF("S", "P", "O")

val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = df.withColumn("S", regexp_extract($"S", url_regex, 1))
            .withColumn("P", regexp_extract($"P", url_regex, 1))
            .withColumn("O", regexp_extract($"O", url_regex, 1))

这将输出:

+---------------------------+---------------------+----------------------------+
|S                          |P                    |O                           |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name      |James Earl Carter           |
|http://website/Jimmy_Car   |http://web/country   |http://website/United_States|
|http://website/Jimmy_Car   |http://web/birthPlace|http://web/Georgia_(US)     |
+---------------------------+---------------------+----------------------------+

稍微解释一下正则表达式的工作原理,即使这不是 post 的主题。

  1. (?:"|<{1}\s?) 识别以 "<<
  2. 开头的行
  3. (.*) 将匹配内容提取到第 1 组
  4. (?:>(?:\s\.)?|,\s.*) 确定以 >> .,\s.* 结尾的行,最后一个是 James Earl 案例

你不能像这样阅读 Turtle 文件。另外,正则表达式是一种非常天真的阅读 N-Triples 的方式。不要重新发明轮子,用https://github.com/banana-rdf/banana-rdf