SkFlow:将数字和文本数据输入模型
SkFlow: Inputing numerical and text data into the model
我处于学习的早期阶段 SkFlow/TensorFlow,所以我将阐述我对我正在尝试做的事情的理解,尽管它可能不正确。
假设我正在尝试建立一个模型来预测汽车是否会通过排放测试。
我的训练和测试 csv 可能看起来像这样
make, fuel, year, mileage, days since service, passed test
vw, diesel, 2015, 10000, 20, 0
honda, petrol, 2008, 1000000, 234, 1
因此 pass/fail
列为 y,其他列为 x。
到目前为止,在我之前 中 Baltimore 的帮助下,我能够从 CSV 文件处理 Iris 数据集。然而,该数据集全是数字。
此 example on the TensorFlow 网站展示了一个使用分类数据和连续数据构建的人口普查数据模型。我正在尝试使用 SkFlow,因为我知道它可以简化流程。
无论如何,到我的代码
x_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype=None, delimiter=',' , usecols=(0, 1, 2, 3,4))
y_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype='int', delimiter=',', usecols = (5))
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=1)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=2,
model_dir="./tmp/model1")
# Fit model. Add your train data here
classifier.fit(x=x_train,y=y_train,steps=2000)
因此,我已将我的 csv 数据准确读取到我的 x_train
和 y_train
object 中。 CSV 没有 headers,但如果需要可以做。
我相信我正在尝试定义哪些列具有哪种数据,例如
make = tf.contrib.layers.sparse_column_with_hash_bucket("make", hash_bucket_size=1000)
fuel = tf.contrib.layers.sparse_column_with_keys(column_name="fuel", keys=["diesel", "petrol"])
如何构建传递给 classifier
的 feature_columns
object?
这是我的尝试。 input_fn 函数创建一个张量字典,这些张量通过包装器传递到拟合和评估方法中。创建模型时使用该字典。它定义了将要使用的数据。其他常数值张量是数据。它们是在使用 feature_columns 参数拟合模型期间传入的内容:feature_columns=[gear,mpg,cyl...]。
我遗漏了所有交叉列的东西,但可以放入。
我关闭了 WARNINGS,但如果您需要,开关就在那里。这也会产生数量惊人的日志数据,因此请务必使用 tensorboard 查看图表。
# an experiment with regression in Tensorflow using one categorical feature
# MTCARS - auto data. Is the car an Automatic or a Manual Shift?
# Data set location: https://vincentarelbundock.github.io/Rdatasets/datasets.html
# Below is a HIGHLY cut down version of the tensorflow wide tutorial at:
# https://www.tensorflow.org/tutorials/wide/
import tensorflow as tf
import numpy as np
import urllib.request
import tempfile
import pandas as pd
from sklearn.model_selection import train_test_split
LABEL_COLUMN = "label"
COLUMNS = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
CONTINUOUS_COLUMNS = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","carb"]
CATEGORICAL_COLUMNS = ["gear"]
# had to update the urllib stuff for 3.5.
# pull down csv file
# I am running on ubuntu 14.04, so I don't know how well the tempfile stuff will work on Windows.
# NamedTemporaryFile might have problems
data_file = tempfile.NamedTemporaryFile()
urllib.request.urlretrieve("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", data_file.name)
cars = pd.read_csv(data_file, names=COLUMNS, skipinitialspace=True,skiprows=1)
# I want the "am" column as my label, so rename it - not really necessary,
# just trying to stay in sync the wide tutorial
# am: 0 = Automatic 1 = Manual
cars.rename(columns={'am':LABEL_COLUMN}, inplace=True)
# turn gears into a categorical variable, again not really useful, but I want some categorical data
# turn the numbers into strings. I sure there is a oneliner somewhere that can do this...
cars['gear'] = cars['gear'].astype(str)
cars['gear'] = cars['gear'].replace({'3': 'THREE'}, regex=True)
cars['gear'] = cars['gear'].replace({'4': 'FOUR'}, regex=True)
cars['gear'] = cars['gear'].replace({'5': 'FIVE'}, regex=True)
# split into train and tests set - there is a woefully small number of rows here. Need a bigger data set.
train, test = train_test_split(cars, test_size = 0.2)
# These methods are a copy of the input functions from the tensorflow wide tutorial updated for python 3.5
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1])
for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
# Old CODE
#feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# NEW CODE - python 3.5
feature_cols = dict(continuous_cols)
feature_cols.update(categorical_cols)
# Converts the label column into a constant Tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# Returns the feature columns and the label.
return feature_cols, label
def train_input_fn():
return input_fn(train)
def eval_input_fn():
return input_fn(test)
# shut down WARNINGs
# You can adjust by using DEBUG, INFO, WARN, ERROR, or FATAL
tf.logging.set_verbosity(tf.logging.ERROR)
# set up the TF column for the categorical variable
gear = tf.contrib.layers.sparse_column_with_keys(column_name="gear", keys=["THREE", "FOUR", "FIVE"])
# if my categorical data had more than 10 keys, I would use:
#gear = tf.contrib.layers.sparse_column_with_hash_bucket("gear", hash_bucket_size=1000)
# set up the TF columns for the continous variables
mpg = tf.contrib.layers.real_valued_column("mpg")
cyl = tf.contrib.layers.real_valued_column("cyl")
disp = tf.contrib.layers.real_valued_column("disp")
hp = tf.contrib.layers.real_valued_column("hp")
drat = tf.contrib.layers.real_valued_column("drat")
wt = tf.contrib.layers.real_valued_column("wt")
qsec = tf.contrib.layers.real_valued_column("qsec")
vs = tf.contrib.layers.real_valued_column("vs")
carb = tf.contrib.layers.real_valued_column("carb")
# Build the model. Make sure the logs dir already exists.
model_dir = "./logs"
m = tf.contrib.learn.LinearClassifier(
feature_columns=[gear,mpg,cyl,disp,hp,drat,wt,qsec,vs,carb],
optimizer=tf.train.FtrlOptimizer(
learning_rate=0.01,
l1_regularization_strength=1.0,
l2_regularization_strength=1.0),
model_dir=model_dir)
m.fit(input_fn=train_input_fn,steps=200)
# Results were not bad for a very small data set, but the recall is suspect
# In reality, these numbers don't mean a thing with such small data
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
`
我处于学习的早期阶段 SkFlow/TensorFlow,所以我将阐述我对我正在尝试做的事情的理解,尽管它可能不正确。
假设我正在尝试建立一个模型来预测汽车是否会通过排放测试。
我的训练和测试 csv 可能看起来像这样
make, fuel, year, mileage, days since service, passed test
vw, diesel, 2015, 10000, 20, 0
honda, petrol, 2008, 1000000, 234, 1
因此 pass/fail
列为 y,其他列为 x。
到目前为止,在我之前
此 example on the TensorFlow 网站展示了一个使用分类数据和连续数据构建的人口普查数据模型。我正在尝试使用 SkFlow,因为我知道它可以简化流程。
无论如何,到我的代码
x_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype=None, delimiter=',' , usecols=(0, 1, 2, 3,4))
y_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype='int', delimiter=',', usecols = (5))
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=1)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=2,
model_dir="./tmp/model1")
# Fit model. Add your train data here
classifier.fit(x=x_train,y=y_train,steps=2000)
因此,我已将我的 csv 数据准确读取到我的 x_train
和 y_train
object 中。 CSV 没有 headers,但如果需要可以做。
我相信我正在尝试定义哪些列具有哪种数据,例如
make = tf.contrib.layers.sparse_column_with_hash_bucket("make", hash_bucket_size=1000)
fuel = tf.contrib.layers.sparse_column_with_keys(column_name="fuel", keys=["diesel", "petrol"])
如何构建传递给 classifier
的 feature_columns
object?
这是我的尝试。 input_fn 函数创建一个张量字典,这些张量通过包装器传递到拟合和评估方法中。创建模型时使用该字典。它定义了将要使用的数据。其他常数值张量是数据。它们是在使用 feature_columns 参数拟合模型期间传入的内容:feature_columns=[gear,mpg,cyl...]。
我遗漏了所有交叉列的东西,但可以放入。
我关闭了 WARNINGS,但如果您需要,开关就在那里。这也会产生数量惊人的日志数据,因此请务必使用 tensorboard 查看图表。
# an experiment with regression in Tensorflow using one categorical feature
# MTCARS - auto data. Is the car an Automatic or a Manual Shift?
# Data set location: https://vincentarelbundock.github.io/Rdatasets/datasets.html
# Below is a HIGHLY cut down version of the tensorflow wide tutorial at:
# https://www.tensorflow.org/tutorials/wide/
import tensorflow as tf
import numpy as np
import urllib.request
import tempfile
import pandas as pd
from sklearn.model_selection import train_test_split
LABEL_COLUMN = "label"
COLUMNS = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
CONTINUOUS_COLUMNS = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","carb"]
CATEGORICAL_COLUMNS = ["gear"]
# had to update the urllib stuff for 3.5.
# pull down csv file
# I am running on ubuntu 14.04, so I don't know how well the tempfile stuff will work on Windows.
# NamedTemporaryFile might have problems
data_file = tempfile.NamedTemporaryFile()
urllib.request.urlretrieve("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", data_file.name)
cars = pd.read_csv(data_file, names=COLUMNS, skipinitialspace=True,skiprows=1)
# I want the "am" column as my label, so rename it - not really necessary,
# just trying to stay in sync the wide tutorial
# am: 0 = Automatic 1 = Manual
cars.rename(columns={'am':LABEL_COLUMN}, inplace=True)
# turn gears into a categorical variable, again not really useful, but I want some categorical data
# turn the numbers into strings. I sure there is a oneliner somewhere that can do this...
cars['gear'] = cars['gear'].astype(str)
cars['gear'] = cars['gear'].replace({'3': 'THREE'}, regex=True)
cars['gear'] = cars['gear'].replace({'4': 'FOUR'}, regex=True)
cars['gear'] = cars['gear'].replace({'5': 'FIVE'}, regex=True)
# split into train and tests set - there is a woefully small number of rows here. Need a bigger data set.
train, test = train_test_split(cars, test_size = 0.2)
# These methods are a copy of the input functions from the tensorflow wide tutorial updated for python 3.5
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1])
for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
# Old CODE
#feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# NEW CODE - python 3.5
feature_cols = dict(continuous_cols)
feature_cols.update(categorical_cols)
# Converts the label column into a constant Tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# Returns the feature columns and the label.
return feature_cols, label
def train_input_fn():
return input_fn(train)
def eval_input_fn():
return input_fn(test)
# shut down WARNINGs
# You can adjust by using DEBUG, INFO, WARN, ERROR, or FATAL
tf.logging.set_verbosity(tf.logging.ERROR)
# set up the TF column for the categorical variable
gear = tf.contrib.layers.sparse_column_with_keys(column_name="gear", keys=["THREE", "FOUR", "FIVE"])
# if my categorical data had more than 10 keys, I would use:
#gear = tf.contrib.layers.sparse_column_with_hash_bucket("gear", hash_bucket_size=1000)
# set up the TF columns for the continous variables
mpg = tf.contrib.layers.real_valued_column("mpg")
cyl = tf.contrib.layers.real_valued_column("cyl")
disp = tf.contrib.layers.real_valued_column("disp")
hp = tf.contrib.layers.real_valued_column("hp")
drat = tf.contrib.layers.real_valued_column("drat")
wt = tf.contrib.layers.real_valued_column("wt")
qsec = tf.contrib.layers.real_valued_column("qsec")
vs = tf.contrib.layers.real_valued_column("vs")
carb = tf.contrib.layers.real_valued_column("carb")
# Build the model. Make sure the logs dir already exists.
model_dir = "./logs"
m = tf.contrib.learn.LinearClassifier(
feature_columns=[gear,mpg,cyl,disp,hp,drat,wt,qsec,vs,carb],
optimizer=tf.train.FtrlOptimizer(
learning_rate=0.01,
l1_regularization_strength=1.0,
l2_regularization_strength=1.0),
model_dir=model_dir)
m.fit(input_fn=train_input_fn,steps=200)
# Results were not bad for a very small data set, but the recall is suspect
# In reality, these numbers don't mean a thing with such small data
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
`