如何在 spark-shell / pyspark 中打印出 RDD 的片段?
how to print out snippets of a RDD in the spark-shell / pyspark?
在 spark-shell 中工作时,我经常想检查 RDD(类似于在 unix 中使用 head
)。
例如:
scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> // how to inspect the readmeFile?
和...
scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> // how to inspect linesContainingSpark?
我找到了如何做到这一点 (here) 并且认为这对其他用户有用,所以在这里分享。 take(x)
选择前 x 项并 foreach 打印它们:
scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
和...
scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
下面的示例是等效的,但使用的是 pyspark:
>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
...
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
和
>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
...
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
在 spark-shell 中工作时,我经常想检查 RDD(类似于在 unix 中使用 head
)。
例如:
scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> // how to inspect the readmeFile?
和...
scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> // how to inspect linesContainingSpark?
我找到了如何做到这一点 (here) 并且认为这对其他用户有用,所以在这里分享。 take(x)
选择前 x 项并 foreach 打印它们:
scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
和...
scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
下面的示例是等效的,但使用的是 pyspark:
>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
...
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
和
>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
...
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming