将 dask 数据框列转换为字符串

Question

import pandas as pd                                                 
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}     
                                                                    
df = pd.DataFrame(data)                                             
                                                                    
df = dd.from_pandas(df, npartitions=1)

我想将卡路里列转换为字符串。

所以这是我的尝试

df = df.astype({"calories": "string"})

df
Dask DataFrame Structure:
              calories duration
npartitions=1
0               string    int64
2                  ...      ...
Dask Name: astype, 3 tasks

df.set_index("calories")
TypeError: Cannot interpret 'string[python]' as a data type

有没有一种方法可以为所有列传递数据类型并将它们转换为所需的数据类型？就像说我想将许多列转换为字符串，其中一些是最新的，很少是布尔值。

我知道列名和数据类型。并希望 Dask 尊重他们。

TypeError：无法将 'string[python]' 解释为数据类型

Answer 1

尝试使用 lambda 函数。

import pandas as pd                                                 
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}                                                                  
df = pd.DataFrame(data)                                             
df['calories'] = df['calories'].apply(lambda x: str(x))                                                               
print(df)
for column in df.columns:
    print("Column ", column, "is dtype:", df[column].dtype.name)

输出

  calories  duration
0      420        50
1       hi        40
2      390        45
Column  calories is dtype: object
Column  duration is dtype: int64

编辑如果想把data frame的所有列都转成string，可以使用applymap.

import pandas as pd                                                 
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}                                                                  
df = pd.DataFrame(data)                                             
df = df.applymap(str)
for column in df.columns:
    print("Column ", column, "is dtype:", df[column].dtype.name)

输出

Column  calories is dtype: object
Column  duration is dtype: object

或使用lambda和applymap

import pandas as pd                                                 
                                                                
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}                                                                  
df = pd.DataFrame(data)                                             
df = df.applymap(lambda x: x[0] if type(x) is list else None)
for column in df.columns:
    print("Column ", column, "is dtype:", df[column].dtype.name)

输出

Column  calories is dtype: object
Column  duration is dtype: object

Answer 2

似乎错误发生在您调用 set_index 时，Dask 在 setting the new partition divisions 时无法将 "string" 识别为有效数据类型。相反，您可以使用 str，例如ddf = ddf.astype({"calories": str})。这是一个完整的可复制片段：

import pandas as pd
import dask.dataframe as dd
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45], "other_col": range(3)}                               
df = pd.DataFrame(data)                                          
ddf = dd.from_pandas(df, npartitions=2)

ddf = ddf.astype({"calories": str}).set_index('calories')
ddf.compute()

将 dask 数据框列转换为字符串

convert dask dataframe column to string

python

pandas

dask