创建具有不同类型的结构化 Numpy 数组
Create Structured Numpy Array with Different Types
我有以下非结构化数据(从 csv 读取)。
data = [[b'id' b'datetime' b'anomaly_length' b'affected_sensors' b'reason']
[b'1' b'2019-12-20 08:09' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:10' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:11' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:12' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:13' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:14' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:15' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:16' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:17' b'26' b'all' b'Open Windows']]
...
我目前使用以下代码创建结构化数组:
labels_id = np.array(data[1:,0], dtype=int)
labels = [dt.datetime.strptime(date.decode("utf-8"), '%Y-%m-%d %H:%M') for date in np.array(data[1:,1])]
labels_length = np.array(data[1:,2], dtype=int)
此代码是必需的,因为我需要具有正确数据类型的数据。在函数中,我传递了所有数组并按索引访问它们。我不喜欢这个解决方案,但是因为函数被调用了多次,所以我不想每次都将数据转换到函数中。
源函数代码:
def special_find(labels_id, labels, labels_length):
for i, id in enumerate(labels_id):
print(id)
print(labels[i])
print(labels_length[i])
...
预期:我想要一个只包含所需列的结构化数组:
structured_data = [[1 datetime.datetime(2019, 12, 20, 8, 9) b'2019-12-20 08:09' 26],
[1 datetime.datetime(2019, 12, 20, 8, 10) 26],
[1 datetime.datetime(2019, 12, 20, 8, 11) 26],
[1 datetime.datetime(2019, 12, 20, 8, 12) 26],
[1 datetime.datetime(2019, 12, 20, 8, 13) 26],
[1 datetime.datetime(2019, 12, 20, 8, 14) 26],
...
我知道我可以连接所有创建的数组,但我认为这不是一个好的解决方案。相反,我正在寻找这样的东西:
structured_data = np.array(data[1:, 0:3], dtype=...)
更新:这里有一些 csv 文件的值
id,datetime,anomaly_length,affected_sensors,reason
1,2019-12-20 08:09,26,all,Open Windows
1,2019-12-20 08:10,26,all,Open Windows
1,2019-12-20 08:11,26,all,Open Windows
1,2019-12-20 08:12,26,all,Open Windows
1,2019-12-20 08:13,26,all,Open Windows
1,2019-12-20 08:14,26,all,Open Windows
1,2019-12-20 08:15,26,all,Open Windows
1,2019-12-20 08:16,26,all,Open Windows
1,2019-12-20 08:17,26,all,Open Windows
由于您已经将列转换为正确数据类型的 NumPy 数组,因此很容易从中创建 Pandas DataFrame
,例如:
import pandas as pd
df = pd.DataFrame({
'id': labels_id,
'datetime': labels,
'anomaly_length': labels_length
})
>>> df
id datetime anomaly_length
0 1 2019-12-20 08:09:00 26
1 1 2019-12-20 08:10:00 26
2 1 2019-12-20 08:11:00 26
3 1 2019-12-20 08:12:00 26
4 1 2019-12-20 08:13:00 26
5 1 2019-12-20 08:14:00 26
6 1 2019-12-20 08:15:00 26
7 1 2019-12-20 08:16:00 26
8 1 2019-12-20 08:17:00 26
Pandas docs 很好地介绍了如何使用这些对象。
我试图重新创建你的 csv 文件:
In [23]: cat stack59665655.txt
id, datetime, anomaly_length, affected_sensors, reason
1, 2019-12-20 08:09, 26, all, Open Windows
1, 2019-12-20 08:10, 26, all, Open Windows
1, 2019-12-20 08:11, 26, all, Open Windows
有了 pandas
我可以阅读它:
In [24]: data = pd.read_csv('stack59665655.txt')
In [25]: data
Out[25]:
id datetime anomaly_length affected_sensors reason
0 1 2019-12-20 08:09 26 all Open Windows
1 1 2019-12-20 08:10 26 all Open Windows
2 1 2019-12-20 08:11 26 all Open Windows
In [26]: data.dtypes
Out[26]:
id int64
datetime object
anomaly_length int64
affected_sensors object
reason object
dtype: object
object
列包含字符串。我怀疑 pandas 有办法将 datetime
字符串列转换为 datetime
对象或 np.datetime64
.
到数组的简单转换,生成一个对象 dtype 数组:
In [27]: data.to_numpy()
Out[27]:
array([[1, ' 2019-12-20 08:09', 26, ' all', ' Open Windows'],
[1, ' 2019-12-20 08:10', 26, ' all', ' Open Windows'],
[1, ' 2019-12-20 08:11', 26, ' all', ' Open Windows']],
dtype=object)
to_records
生成一个 record
数组,这是结构化数组的变体。注意复合数据类型:
In [28]: data.to_records()
Out[28]:
rec.array([(0, 1, ' 2019-12-20 08:09', 26, ' all', ' Open Windows'),
(1, 1, ' 2019-12-20 08:10', 26, ' all', ' Open Windows'),
(2, 1, ' 2019-12-20 08:11', 26, ' all', ' Open Windows')],
dtype=[('index', '<i8'), ('id', '<i8'), (' datetime', 'O'), (' anomaly_length', '<i8'), (' affected_sensors', 'O'), (' reason', 'O')])
相反,使用 genfromtxt
及其自动 dtype 模式:
In [29]: data1 =np.genfromtxt('stack59665655.txt',dtype=None, names=True,delimit
...: er=',',encoding=None, autostrip=True)
In [30]: data1
Out[30]:
array([(1, '2019-12-20 08:09', 26, 'all', 'Open Windows'),
(1, '2019-12-20 08:10', 26, 'all', 'Open Windows'),
(1, '2019-12-20 08:11', 26, 'all', 'Open Windows')],
dtype=[('id', '<i8'), ('datetime', '<U16'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])
我可以将 datetime
字段转换为:
In [31]: data1['datetime']
Out[31]:
array(['2019-12-20 08:09', '2019-12-20 08:10', '2019-12-20 08:11'],
dtype='<U16')
In [32]: data1['datetime'].astype('datetime64[m]')
Out[32]:
array(['2019-12-20T08:09', '2019-12-20T08:10', '2019-12-20T08:11'],
dtype='datetime64[m]')
就地更改此实际上需要定义一个新的数据类型。
或者我可以构造一个自定义数据类型,例如修改为 data1
:
推导的数据类型
In [45]: dt = data1.dtype.descr
In [46]: dt[1]=('datetime', 'datetime64[m]')
In [47]: dt= np.dtype(dt)
In [48]: dt
Out[48]: dtype([('id', '<i8'), ('datetime', '<M8[m]'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])
In [49]: data2 =np.genfromtxt('stack59665655.txt',dtype=dt, names=True,delimiter
...: =',',encoding=None, autostrip=True)
In [50]: data2
Out[50]:
array([(1, '2019-12-20T08:09', 26, 'all', 'Open Windows'),
(1, '2019-12-20T08:10', 26, 'all', 'Open Windows'),
(1, '2019-12-20T08:11', 26, 'all', 'Open Windows')],
dtype=[('id', '<i8'), ('datetime', '<M8[m]'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])
要使用 datetime
对象,我必须在 `genfromtxt.
中使用 converter
我将 pandas 中的 read_csv
与`converters:
结合在一起
import pandas as pd
import datetime as dt
filename = './data.csv'
to_date = lambda value: (dt.datetime.strptime(value, '%Y-%m-%d %H:%M'))
values = pd.read_csv(filename, converters={'datetime': to_date})
print(values.dtypes)
>>> OUTPUT:
>>> id int64
>>> datetime datetime64[ns]
>>> anomaly_length int64
>>> affected_sensors object
>>> reason object
>>> dtype: object
我有以下非结构化数据(从 csv 读取)。
data = [[b'id' b'datetime' b'anomaly_length' b'affected_sensors' b'reason']
[b'1' b'2019-12-20 08:09' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:10' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:11' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:12' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:13' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:14' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:15' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:16' b'26' b'all' b'Open Windows']
[b'1' b'2019-12-20 08:17' b'26' b'all' b'Open Windows']]
...
我目前使用以下代码创建结构化数组:
labels_id = np.array(data[1:,0], dtype=int)
labels = [dt.datetime.strptime(date.decode("utf-8"), '%Y-%m-%d %H:%M') for date in np.array(data[1:,1])]
labels_length = np.array(data[1:,2], dtype=int)
此代码是必需的,因为我需要具有正确数据类型的数据。在函数中,我传递了所有数组并按索引访问它们。我不喜欢这个解决方案,但是因为函数被调用了多次,所以我不想每次都将数据转换到函数中。
源函数代码:
def special_find(labels_id, labels, labels_length):
for i, id in enumerate(labels_id):
print(id)
print(labels[i])
print(labels_length[i])
...
预期:我想要一个只包含所需列的结构化数组:
structured_data = [[1 datetime.datetime(2019, 12, 20, 8, 9) b'2019-12-20 08:09' 26],
[1 datetime.datetime(2019, 12, 20, 8, 10) 26],
[1 datetime.datetime(2019, 12, 20, 8, 11) 26],
[1 datetime.datetime(2019, 12, 20, 8, 12) 26],
[1 datetime.datetime(2019, 12, 20, 8, 13) 26],
[1 datetime.datetime(2019, 12, 20, 8, 14) 26],
...
我知道我可以连接所有创建的数组,但我认为这不是一个好的解决方案。相反,我正在寻找这样的东西:
structured_data = np.array(data[1:, 0:3], dtype=...)
更新:这里有一些 csv 文件的值
id,datetime,anomaly_length,affected_sensors,reason
1,2019-12-20 08:09,26,all,Open Windows
1,2019-12-20 08:10,26,all,Open Windows
1,2019-12-20 08:11,26,all,Open Windows
1,2019-12-20 08:12,26,all,Open Windows
1,2019-12-20 08:13,26,all,Open Windows
1,2019-12-20 08:14,26,all,Open Windows
1,2019-12-20 08:15,26,all,Open Windows
1,2019-12-20 08:16,26,all,Open Windows
1,2019-12-20 08:17,26,all,Open Windows
由于您已经将列转换为正确数据类型的 NumPy 数组,因此很容易从中创建 Pandas DataFrame
,例如:
import pandas as pd
df = pd.DataFrame({
'id': labels_id,
'datetime': labels,
'anomaly_length': labels_length
})
>>> df
id datetime anomaly_length
0 1 2019-12-20 08:09:00 26
1 1 2019-12-20 08:10:00 26
2 1 2019-12-20 08:11:00 26
3 1 2019-12-20 08:12:00 26
4 1 2019-12-20 08:13:00 26
5 1 2019-12-20 08:14:00 26
6 1 2019-12-20 08:15:00 26
7 1 2019-12-20 08:16:00 26
8 1 2019-12-20 08:17:00 26
Pandas docs 很好地介绍了如何使用这些对象。
我试图重新创建你的 csv 文件:
In [23]: cat stack59665655.txt
id, datetime, anomaly_length, affected_sensors, reason
1, 2019-12-20 08:09, 26, all, Open Windows
1, 2019-12-20 08:10, 26, all, Open Windows
1, 2019-12-20 08:11, 26, all, Open Windows
有了 pandas
我可以阅读它:
In [24]: data = pd.read_csv('stack59665655.txt')
In [25]: data
Out[25]:
id datetime anomaly_length affected_sensors reason
0 1 2019-12-20 08:09 26 all Open Windows
1 1 2019-12-20 08:10 26 all Open Windows
2 1 2019-12-20 08:11 26 all Open Windows
In [26]: data.dtypes
Out[26]:
id int64
datetime object
anomaly_length int64
affected_sensors object
reason object
dtype: object
object
列包含字符串。我怀疑 pandas 有办法将 datetime
字符串列转换为 datetime
对象或 np.datetime64
.
到数组的简单转换,生成一个对象 dtype 数组:
In [27]: data.to_numpy()
Out[27]:
array([[1, ' 2019-12-20 08:09', 26, ' all', ' Open Windows'],
[1, ' 2019-12-20 08:10', 26, ' all', ' Open Windows'],
[1, ' 2019-12-20 08:11', 26, ' all', ' Open Windows']],
dtype=object)
to_records
生成一个 record
数组,这是结构化数组的变体。注意复合数据类型:
In [28]: data.to_records()
Out[28]:
rec.array([(0, 1, ' 2019-12-20 08:09', 26, ' all', ' Open Windows'),
(1, 1, ' 2019-12-20 08:10', 26, ' all', ' Open Windows'),
(2, 1, ' 2019-12-20 08:11', 26, ' all', ' Open Windows')],
dtype=[('index', '<i8'), ('id', '<i8'), (' datetime', 'O'), (' anomaly_length', '<i8'), (' affected_sensors', 'O'), (' reason', 'O')])
相反,使用 genfromtxt
及其自动 dtype 模式:
In [29]: data1 =np.genfromtxt('stack59665655.txt',dtype=None, names=True,delimit
...: er=',',encoding=None, autostrip=True)
In [30]: data1
Out[30]:
array([(1, '2019-12-20 08:09', 26, 'all', 'Open Windows'),
(1, '2019-12-20 08:10', 26, 'all', 'Open Windows'),
(1, '2019-12-20 08:11', 26, 'all', 'Open Windows')],
dtype=[('id', '<i8'), ('datetime', '<U16'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])
我可以将 datetime
字段转换为:
In [31]: data1['datetime']
Out[31]:
array(['2019-12-20 08:09', '2019-12-20 08:10', '2019-12-20 08:11'],
dtype='<U16')
In [32]: data1['datetime'].astype('datetime64[m]')
Out[32]:
array(['2019-12-20T08:09', '2019-12-20T08:10', '2019-12-20T08:11'],
dtype='datetime64[m]')
就地更改此实际上需要定义一个新的数据类型。
或者我可以构造一个自定义数据类型,例如修改为 data1
:
In [45]: dt = data1.dtype.descr
In [46]: dt[1]=('datetime', 'datetime64[m]')
In [47]: dt= np.dtype(dt)
In [48]: dt
Out[48]: dtype([('id', '<i8'), ('datetime', '<M8[m]'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])
In [49]: data2 =np.genfromtxt('stack59665655.txt',dtype=dt, names=True,delimiter
...: =',',encoding=None, autostrip=True)
In [50]: data2
Out[50]:
array([(1, '2019-12-20T08:09', 26, 'all', 'Open Windows'),
(1, '2019-12-20T08:10', 26, 'all', 'Open Windows'),
(1, '2019-12-20T08:11', 26, 'all', 'Open Windows')],
dtype=[('id', '<i8'), ('datetime', '<M8[m]'), ('anomaly_length', '<i8'), ('affected_sensors', '<U3'), ('reason', '<U12')])
要使用 datetime
对象,我必须在 `genfromtxt.
converter
我将 pandas 中的 read_csv
与`converters:
import pandas as pd
import datetime as dt
filename = './data.csv'
to_date = lambda value: (dt.datetime.strptime(value, '%Y-%m-%d %H:%M'))
values = pd.read_csv(filename, converters={'datetime': to_date})
print(values.dtypes)
>>> OUTPUT:
>>> id int64
>>> datetime datetime64[ns]
>>> anomaly_length int64
>>> affected_sensors object
>>> reason object
>>> dtype: object