将 dask 数据框列转换为字符串
convert dask dataframe column to string
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
我想将卡路里列转换为字符串。
所以这是我的尝试
df = df.astype({"calories": "string"})
df
Dask DataFrame Structure:
calories duration
npartitions=1
0 string int64
2 ... ...
Dask Name: astype, 3 tasks
df.set_index("calories")
TypeError: Cannot interpret 'string[python]' as a data type
有没有一种方法可以为所有列传递数据类型并将它们转换为所需的数据类型?
就像说我想将许多列转换为字符串,其中一些是最新的,很少是布尔值。
我知道列名和数据类型。
并希望 Dask 尊重他们。
TypeError:无法将 'string[python]' 解释为数据类型
尝试使用 lambda 函数。
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df['calories'] = df['calories'].apply(lambda x: str(x))
print(df)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
输出
calories duration
0 420 50
1 hi 40
2 390 45
Column calories is dtype: object
Column duration is dtype: int64
编辑
如果想把data frame的所有列都转成string,可以使用applymap.
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = df.applymap(str)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
输出
Column calories is dtype: object
Column duration is dtype: object
或使用lambda和applymap
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = df.applymap(lambda x: x[0] if type(x) is list else None)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
输出
Column calories is dtype: object
Column duration is dtype: object
似乎错误发生在您调用 set_index
时,Dask 在 setting the new partition divisions 时无法将 "string"
识别为有效数据类型。相反,您可以使用 str
,例如ddf = ddf.astype({"calories": str})
。这是一个完整的可复制片段:
import pandas as pd
import dask.dataframe as dd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45], "other_col": range(3)}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.astype({"calories": str}).set_index('calories')
ddf.compute()
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
我想将卡路里列转换为字符串。
所以这是我的尝试
df = df.astype({"calories": "string"})
df
Dask DataFrame Structure:
calories duration
npartitions=1
0 string int64
2 ... ...
Dask Name: astype, 3 tasks
df.set_index("calories")
TypeError: Cannot interpret 'string[python]' as a data type
有没有一种方法可以为所有列传递数据类型并将它们转换为所需的数据类型? 就像说我想将许多列转换为字符串,其中一些是最新的,很少是布尔值。
我知道列名和数据类型。 并希望 Dask 尊重他们。
TypeError:无法将 'string[python]' 解释为数据类型
尝试使用 lambda 函数。
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df['calories'] = df['calories'].apply(lambda x: str(x))
print(df)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
输出
calories duration
0 420 50
1 hi 40
2 390 45
Column calories is dtype: object
Column duration is dtype: int64
编辑 如果想把data frame的所有列都转成string,可以使用applymap.
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = df.applymap(str)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
输出
Column calories is dtype: object
Column duration is dtype: object
或使用lambda和applymap
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = df.applymap(lambda x: x[0] if type(x) is list else None)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
输出
Column calories is dtype: object
Column duration is dtype: object
似乎错误发生在您调用 set_index
时,Dask 在 setting the new partition divisions 时无法将 "string"
识别为有效数据类型。相反,您可以使用 str
,例如ddf = ddf.astype({"calories": str})
。这是一个完整的可复制片段:
import pandas as pd
import dask.dataframe as dd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45], "other_col": range(3)}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.astype({"calories": str}).set_index('calories')
ddf.compute()