Dask from_array 将类型转换为对象
Dask from_array converts types to object
我有以下代码从数组创建一个 dask 数据框。问题是所有类型都转换为对象。我试图通过找不到方法来指定元数据。如何在 from_array?
中指定元数据
b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'], meta=['float', 'float', 'float', 'datetime'])
这会抛出 AttributeError: 'list' object has no attribute '_constructor'
看看你的b
数组
In [61]: from datetime import datetime
In [62]: b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001
...: , 2, 2))])
In [63]: b
Out[63]:
array([[1.5, 2, 3, datetime.datetime(2000, 1, 1, 0, 0)],
[4, 5, 6, datetime.datetime(2001, 2, 2, 0, 0)]], dtype=object)
In [93]: pd.DataFrame(b.tolist())
Out[93]:
0 1 2 3
0 1.5 2 3 2000-01-01
1 4.0 5 6 2001-02-02
In [94]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 2 non-null float64
1 1 2 non-null int64
2 2 2 non-null int64
3 3 2 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 192.0 bytes
In [95]: b1 = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5,
...: 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),(
...: 'col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
In [96]: pd.DataFrame(b1)
Out[96]:
col1 col2 col3 date1
0 1.5 2.0 3.0 2000-01-01
1 4.0 5.0 6.0 2001-02-02
In [97]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 2 non-null float32
1 col2 2 non-null float32
2 col3 2 non-null float32
3 date1 2 non-null datetime64[ns]
dtypes: datetime64[ns](1), float32(3)
memory usage: 168.0 bytes
您可以将 numpy 数组指定为结构化数组:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from datetime import datetime
b = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5, 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),('col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'])
ddf.head()
我有以下代码从数组创建一个 dask 数据框。问题是所有类型都转换为对象。我试图通过找不到方法来指定元数据。如何在 from_array?
中指定元数据b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'], meta=['float', 'float', 'float', 'datetime'])
这会抛出 AttributeError: 'list' object has no attribute '_constructor'
看看你的b
数组
In [61]: from datetime import datetime
In [62]: b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001
...: , 2, 2))])
In [63]: b
Out[63]:
array([[1.5, 2, 3, datetime.datetime(2000, 1, 1, 0, 0)],
[4, 5, 6, datetime.datetime(2001, 2, 2, 0, 0)]], dtype=object)
In [93]: pd.DataFrame(b.tolist())
Out[93]:
0 1 2 3
0 1.5 2 3 2000-01-01
1 4.0 5 6 2001-02-02
In [94]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 2 non-null float64
1 1 2 non-null int64
2 2 2 non-null int64
3 3 2 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 192.0 bytes
In [95]: b1 = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5,
...: 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),(
...: 'col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
In [96]: pd.DataFrame(b1)
Out[96]:
col1 col2 col3 date1
0 1.5 2.0 3.0 2000-01-01
1 4.0 5.0 6.0 2001-02-02
In [97]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 2 non-null float32
1 col2 2 non-null float32
2 col3 2 non-null float32
3 date1 2 non-null datetime64[ns]
dtypes: datetime64[ns](1), float32(3)
memory usage: 168.0 bytes
您可以将 numpy 数组指定为结构化数组:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from datetime import datetime
b = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5, 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),('col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'])
ddf.head()