Dask 警告提供明确的输出类型

Dask warning provide an explicit output types

我正在使用 Dask 执行以下操作。

    import dask.dataframe as dd
    import pandas as pd
    
    salary_df = pd.DataFrame({"Salary":[10000, 50000, 25000, 30000, 7000]})
    salary_category = pd.DataFrame({"Hi":[5000, 20000, 25000, 30000, 90000],
                            "Low":[0,  5001, 20001, 25001, 30001],
                            "category":["Very Poor", "Poor", "Medium", "Rich", "Super Rich" ]
                            })
    sal_ddf = dd.from_pandas(salary_df, npartitions=10)
    salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
    sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])

我确实得到了结果,但下面一行有警告

      sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])

    You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
    To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
      Before: .apply(func)
      After:  .apply(func, meta=('Salary', 'object'))

我在这里错过了什么?

此处缺少的关键字参数是 meta。 Dask 生成自动建议(在警告消息中):

  After:  .apply(func, meta=('Salary', 'object'))

由于这是一条警告消息,对于许多用例来说,指定 meta 是可选的,但如果您想明确说明计算变量的 dtype 可能会很有用。

运行 下面的代码片段不应生成警告消息:

# extracted your code into `func` for readability only
func = lambda x: salary_category.iloc[salary_category.index.get_loc(x)]['category']

sal_ddf['Category'] = sal_ddf['Salary'].apply(func, meta=('Salary', 'object'))

有关详细信息,此 link 可能有用:meta