将日期拆分为新列的自定义转换器
Custom transformer that splits dates into new column
我正在按照 sklearn_pandas README on github 上的 sklearn_pandas 演练,并尝试修改 DateEncoder() 自定义转换器示例以执行另外两件事:
- 将字符串类型列转换为日期时间,同时将日期格式作为参数
- 吐出新列时附加原始列名。例如:如果输入列:Date1,则输出:Date1_year、Date1_month、Date_1 day.
这是我的尝试(对 sklearn 管道有相当基本的了解):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
但是在运行时,我收到以下错误:
AttributeError: Can only use .dt accessor with datetimelike values
为什么会出现此错误以及如何解决?
也非常感谢使用原始列(Date1_year、Date1_month、Date_1 天)重命名列名称的任何帮助!
我能够将数据格式转换和日期拆分器分解为两个独立的转换器并且它起作用了。
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
此代码输出:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
不过,我仍在寻找一种在应用 DateEncoder()
转换器时重命名新列的方法。例如:dates_1_0
--> dates_1_year
和 dates_2_2
--> dates_2_month
。我很乐意 select 将其作为解决方案。
我知道这已经晚了,但是如果您仍然对在使用自定义转换器重命名列时执行此操作的方法感兴趣...
我使用了将方法 get_feature_names
添加到带有 ColumnTransformer
(overview). You can then use the to access the pipeline's step and then get to get_feature_names
and then get the column_names
, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post 的管道内的自定义转换器的方法。
我不得不 运行 使用管道进行此操作,因为当我尝试将其作为独立的自定义转换器来执行时,它出现了严重错误(因此我不会 post 此处进行不完整的尝试)-虽然你的运气可能更好。
这是显示管道的原始代码
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data2 = pd.DataFrame(
{"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
"2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
"2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})
DATE_COLS = ['dates1', 'dates2']
pipeline = Pipeline([
('transform', ColumnTransformer([
('datetimes', Pipeline([
('formatter', DateFormatter()), ('encoder', DateEncoder()),
]), DATE_COLS),
])),
])
data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
pipeline.named_steps['transform']
.named_transformers_['datetimes']
.named_steps['encoder']
.get_feature_names()
)
data3.columns = data3_names
print(data2)
print(data3)
输出为
dates1 dates2
0 2001-12-20 2012-12-20
1 2002-10-21 2009-10-21
2 2003-08-22 2016-08-22
3 2004-08-23 2017-08-23
4 2004-07-20 2014-07-20
5 2007-12-21 2011-12-21
6 2006-12-22 2014-12-22
7 2003-04-23 2015-04-23
dates1_year dates1_month dates1_day dates2_year dates2_month dates2_day
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
自定义转换器在这里(跳过 DateFormatter
,因为它与您的相同)
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dfs = []
self.column_names = []
for column in X:
dt = X[column].dt
# Assign custom column names
newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
# Append DF to list to assemble list of DFs
dfs.append(df_dt)
# Append single DF's column names to blank list
self.column_names.append(newcolumnnames)
# Horizontally concatenate list of DFs
dfs_dt = pd.concat(dfs, axis=1)
return dfs_dt
def get_feature_names(self):
# Flatten list of column names
self.column_names = [c for sublist in self.column_names for c in sublist]
return self.column_names
DateEncoder
的理由
pandas 列的循环允许从每个日期时间列中提取日期时间属性。在同一个循环中,构造自定义列名。然后将它们添加到 self.column_names
下的空白列表中,该列表在方法 get_feature_names
中返回(尽管在分配给数据帧之前必须将其展平)。
对于这种特殊情况,您可以跳过 sklearn_pandas
。
详情
sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1
我正在按照 sklearn_pandas README on github 上的 sklearn_pandas 演练,并尝试修改 DateEncoder() 自定义转换器示例以执行另外两件事:
- 将字符串类型列转换为日期时间,同时将日期格式作为参数
- 吐出新列时附加原始列名。例如:如果输入列:Date1,则输出:Date1_year、Date1_month、Date_1 day.
这是我的尝试(对 sklearn 管道有相当基本的了解):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
但是在运行时,我收到以下错误:
AttributeError: Can only use .dt accessor with datetimelike values
为什么会出现此错误以及如何解决? 也非常感谢使用原始列(Date1_year、Date1_month、Date_1 天)重命名列名称的任何帮助!
我能够将数据格式转换和日期拆分器分解为两个独立的转换器并且它起作用了。
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
此代码输出:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
不过,我仍在寻找一种在应用 DateEncoder()
转换器时重命名新列的方法。例如:dates_1_0
--> dates_1_year
和 dates_2_2
--> dates_2_month
。我很乐意 select 将其作为解决方案。
我知道这已经晚了,但是如果您仍然对在使用自定义转换器重命名列时执行此操作的方法感兴趣...
我使用了将方法 get_feature_names
添加到带有 ColumnTransformer
(overview). You can then use the get_feature_names
and then get the column_names
, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post 的管道内的自定义转换器的方法。
我不得不 运行 使用管道进行此操作,因为当我尝试将其作为独立的自定义转换器来执行时,它出现了严重错误(因此我不会 post 此处进行不完整的尝试)-虽然你的运气可能更好。
这是显示管道的原始代码
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data2 = pd.DataFrame(
{"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
"2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
"2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})
DATE_COLS = ['dates1', 'dates2']
pipeline = Pipeline([
('transform', ColumnTransformer([
('datetimes', Pipeline([
('formatter', DateFormatter()), ('encoder', DateEncoder()),
]), DATE_COLS),
])),
])
data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
pipeline.named_steps['transform']
.named_transformers_['datetimes']
.named_steps['encoder']
.get_feature_names()
)
data3.columns = data3_names
print(data2)
print(data3)
输出为
dates1 dates2
0 2001-12-20 2012-12-20
1 2002-10-21 2009-10-21
2 2003-08-22 2016-08-22
3 2004-08-23 2017-08-23
4 2004-07-20 2014-07-20
5 2007-12-21 2011-12-21
6 2006-12-22 2014-12-22
7 2003-04-23 2015-04-23
dates1_year dates1_month dates1_day dates2_year dates2_month dates2_day
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
自定义转换器在这里(跳过 DateFormatter
,因为它与您的相同)
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dfs = []
self.column_names = []
for column in X:
dt = X[column].dt
# Assign custom column names
newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
# Append DF to list to assemble list of DFs
dfs.append(df_dt)
# Append single DF's column names to blank list
self.column_names.append(newcolumnnames)
# Horizontally concatenate list of DFs
dfs_dt = pd.concat(dfs, axis=1)
return dfs_dt
def get_feature_names(self):
# Flatten list of column names
self.column_names = [c for sublist in self.column_names for c in sublist]
return self.column_names
DateEncoder
pandas 列的循环允许从每个日期时间列中提取日期时间属性。在同一个循环中,构造自定义列名。然后将它们添加到 self.column_names
下的空白列表中,该列表在方法 get_feature_names
中返回(尽管在分配给数据帧之前必须将其展平)。
对于这种特殊情况,您可以跳过 sklearn_pandas
。
详情
sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1