Pandas：如何使用 df.to_dict() 轻松共享示例数据框？

Question

尽管 How do I ask a good question? and How to create a Minimal, Reproducible Example, many just seem to ignore to include a reproducible data sample in their question. So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5))) is not enough? How can you, for example, use df.to_dict() 上有明确的指导并将输出包含在问题中？

Answer 1

答案：

在许多情况下，使用 df.to_dict() 的方法可以完美地完成工作！以下是我想到的两种情况：

案例 1：您已经从本地来源[=40]中构建或加载了数据框Python =]

案例 2： 您在另一个应用程序中有一个 table（如 Excel）

详情：

案例 1：您已经从本地源构建或加载数据框

鉴于您有一个名为 df 的 pandas 数据框，只需

运行 df.to_dict() 在您的控制台或编辑器中，并且
复制格式化为字典的输出，并且
将内容粘贴到 pd.DataFrame(<output>) 并将该块包含在您现在可重现的代码片段中。

案例 2： 您在另一个应用程序中有一个 table（如 Excel）

取决于来源和分隔符，例如 (',', ';' '\s+')，其中后者表示任何空格，您可以简单地：

Ctrl+C内容
运行 df=pd.read_clipboard(sep='\s+') 在您的控制台或编辑器中，并且
运行df.to_dict()，以及
在 df=pd.DataFrame(<output>)

在这种情况下，您问题的开头应该是这样的：

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
                   1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
                   2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

当然，对于较大的数据帧，这会变得有点笨拙。但通常情况下，所有试图回答您问题的人都需要您的真实世界数据的一小部分样本，以将您的数据结构考虑在内。

有两种方法可以处理更大的数据帧：

运行 df.head(20).to_dict() 只包括第一个 20 rows，和
使用 df.to_dict('split')（除了 'split' 之外还有 other options）更改您的字典格式，以将您的输出重塑为需要更少行数的字典。

这是一个使用 iris 数据集的示例，以及 plotly express 提供的其他地方。

如果你只是运行:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

这将产生近 1000 行的输出，作为可重现的样本不太实用。但是如果你包含 .head(25)，你会得到：

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
 'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

现在我们有所进展。但是根据数据的结构和内容，这可能无法以令人满意的方式涵盖内容的复杂性。但是你可以在更少的行上包含更多数据包括 to_dict('split') 这样的：

import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

现在您的输出将如下所示：

{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'columns': ['sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'species',
  'species_id'],
 'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
  [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
  [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
  [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
  [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
  [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
  [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
  [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

现在您可以轻松地增加 .head(10) 中的数字，而不会使您的问题过于混乱。但是有一个小缺点。现在您不能再直接在 pd.DataFrame 中使用输入。但是，如果您包含一些关于 index, column, and data 的规范，您就可以了。所以对于这个特定的数据集，我的首选方法是：

import pandas as pd
import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
             'columns': ['sepal_length',
              'sepal_width',
              'petal_length',
              'petal_width',
              'species',
              'species_id'],
             'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
              [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
              [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
              [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
              [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
              [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
              [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
              [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.1, 1.5, 0.1, 'setosa', 1],
              [5.4, 3.7, 1.5, 0.2, 'setosa', 1],
              [4.8, 3.4, 1.6, 0.2, 'setosa', 1],
              [4.8, 3.0, 1.4, 0.1, 'setosa', 1],
              [4.3, 3.0, 1.1, 0.1, 'setosa', 1],
              [5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}

df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

现在您将可以使用此数据框：

    sepal_length  sepal_width  petal_length  petal_width species  species_id
0            5.1          3.5           1.4          0.2  setosa           1
1            4.9          3.0           1.4          0.2  setosa           1
2            4.7          3.2           1.3          0.2  setosa           1
3            4.6          3.1           1.5          0.2  setosa           1
4            5.0          3.6           1.4          0.2  setosa           1
5            5.4          3.9           1.7          0.4  setosa           1
6            4.6          3.4           1.4          0.3  setosa           1
7            5.0          3.4           1.5          0.2  setosa           1
8            4.4          2.9           1.4          0.2  setosa           1
9            4.9          3.1           1.5          0.1  setosa           1
10           5.4          3.7           1.5          0.2  setosa           1
11           4.8          3.4           1.6          0.2  setosa           1
12           4.8          3.0           1.4          0.1  setosa           1
13           4.3          3.0           1.1          0.1  setosa           1
14           5.8          4.0           1.2          0.2  setosa           1

这将显着增加您收到有用答案的机会！

编辑：

如果不包括 from pandas import Timestamp

，

df_to_dict() 将无法读取像 1: Timestamp('2020-01-02 00:00:00') 这样的时间戳

Pandas：如何使用 df.to_dict() 轻松共享示例数据框？

Pandas: How to easily share a sample dataframe using df.to_dict()?

python

pandas

plotly

plotly-python

答案：

详情：

有两种方法可以处理更大的数据帧：

编辑：