Dask from_array 将类型转换为对象

Dask from_array converts types to object

我有以下代码从数组创建一个 dask 数据框。问题是所有类型都转换为对象。我试图通过找不到方法来指定元数据。如何在 from_array?

中指定元数据
b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'], meta=['float', 'float', 'float', 'datetime'])

这会抛出 AttributeError: 'list' object has no attribute '_constructor'

看看你的b数组

In [61]: from datetime import datetime
In [62]: b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001
    ...: , 2, 2))])
In [63]: b
Out[63]: 
array([[1.5, 2, 3, datetime.datetime(2000, 1, 1, 0, 0)],
       [4, 5, 6, datetime.datetime(2001, 2, 2, 0, 0)]], dtype=object)

In [93]: pd.DataFrame(b.tolist())
Out[93]: 
     0  1  2          3
0  1.5  2  3 2000-01-01
1  4.0  5  6 2001-02-02
In [94]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   0       2 non-null      float64       
 1   1       2 non-null      int64         
 2   2       2 non-null      int64         
 3   3       2 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 192.0 bytes

In [95]: b1 = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5,
    ...: 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),(
    ...: 'col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
In [96]: pd.DataFrame(b1)
Out[96]: 
   col1  col2  col3      date1
0   1.5   2.0   3.0 2000-01-01
1   4.0   5.0   6.0 2001-02-02
In [97]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   col1    2 non-null      float32       
 1   col2    2 non-null      float32       
 2   col3    2 non-null      float32       
 3   date1   2 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float32(3)
memory usage: 168.0 bytes

您可以将 numpy 数组指定为结构化数组:

import numpy as np
import pandas as pd
import dask.dataframe as dd
from datetime import datetime

b = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5, 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),('col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'])

ddf.head()