Databricks display() 功能相当于或替代 Jupyter

Question

我正在将当前的 DataBricks Spark notebooks 迁移到 Jupyter notebooks，DataBricks 提供了方便漂亮的显示（data_frame）功能，能够可视化 Spark 数据帧和 RDD，但没有直接的等价物对于 Jupyter（我不确定，但我认为它是 DataBricks 的特定功能），我试过：

dataframe.show()

但它是它的文本版本，当你有很多列时它会中断，所以我试图找到 display() 的替代方法，它可以比 show() 函数更好地呈现 Spark 数据帧。有什么等价物或替代品吗？

Answer 1

试试 Apache Zeppelin (https://zeppelin.apache.org/)。数据帧有一些很好的标准可视化效果，特别是如果您使用 sql 解释器。还支持其他有用的解释器。

Answer 2

最近 IPython，如果 df 是熊猫数据框，你可以只使用 display(df)，它就可以工作。在旧版本上，您可能需要执行 from IPython.display import display。如果单元格的最后一个表达式的结果是 data_frame，它也会自动显示。例如，this notebook. Of course the representation will depends on the library you use to make your dataframe. If you are using PySpark and it does not defined a nice representation by default, then you'll need to teach IPython how to display the Spark DataFrame. For example here 是一个教授 IPython 如何显示 Spark 上下文和 Spark 会话的项目。

Answer 3

第一个建议：当你使用 Jupyter 时，不要使用 df.show() 而是使用 df.limit(10).toPandas().head() 这样可以完美显示更好的 Databricks display()

第二条建议：齐柏林笔记本。只需使用 z.show(df.limit(10))

另外在 Zeppelin 中；

您将数据框注册为 SQL Table df.createOrReplaceTempView('tableName')
插入新段落开始 %sql 然后查询你的 table 惊人的显示。

Answer 4

当您使用 Jupyter 时，不要使用 df.show()，而是使用 myDF.limit(10).toPandas().head()。而且，有时，我们正在处理多个列，它会截断视图。因此，只需将 Pandas 视图列配置设置为最大值即可。

# Alternative to Databricks display function.
import pandas as pd
pd.set_option('max_columns', None)

myDF.limit(10).toPandas().head()

Answer 5

不转换为 pandas 数据帧。使用这个...这将在适当的网格中呈现数据框。

from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

df.show()

Databricks display() 功能相当于或替代 Jupyter

Databricks display() function equivalent or alternative to Jupyter

apache-spark

jupyter-notebook

databricks