Python 机器学习训练分类器错误索引超出范围
Python Machine Learning Trained Classifer Error index is out of bounds
我有一个训练有素的分类器, 工作正常。
我试图修改它以使用循环处理多个 .csv 文件,但这已经破坏了它,以至于原始代码(工作正常)现在 returns 同样的错误对于 .csv 文件,它之前处理过没有任何问题。
我很困惑,看不出在之前一切正常的情况下突然出现此错误的原因。原始(工作)代码是;
# -*- coding: utf-8 -*-
import csv
import pandas
import numpy as np
import sklearn.ensemble as ske
import re
import os
import collections
import pickle
from sklearn.externals import joblib
from sklearn import model_selection, tree, linear_model, svm
# Load dataset
url = 'test_6_During_100.csv'
dataset = pandas.read_csv(url)
dataset.set_index('Name', inplace = True)
##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company',
## 'UserProcessorTime','Path','Product','Description',]]
# Open file to output everything to
new_url = re.sub('\.csv$', '', url)
f = open(new_url + " output report", 'w')
f.write(new_url + " output report\n")
f.write("\n")
# shape
print(dataset.shape)
print("\n")
f.write("Dataset shape " + str(dataset.shape) + "\n")
f.write("\n")
clf = joblib.load(os.path.join(
os.path.dirname(os.path.realpath(__file__)),
'classifier/classifier.pkl'))
Class_0 = []
Class_1 = []
prob = []
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
if index in malware:
Class_0.append(index)
elif index in Class_1:
Class_1.append(index)
else:
print "Is ", index, " recognised?"
designation = raw_input()
if designation == "No":
Class_0.append(index)
else:
Class_1.append(index)
dataset['Type'] = 1
dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0
print "\n"
results = []
results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
print (results)
X = dataset.drop(['Type'], axis=1).values
Y = dataset['Type'].values
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')
output = collections.Counter(Class_0)
print "Class_0; \n"
f.write ("Class_0; \n")
for key, value in output.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.write ("\n")
output_1 = collections.Counter(Class_1)
print "Class_1; \n"
f.write ("Class_1; \n")
for key, value in output_1.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.close()
我的新代码是相同的,但包裹在几个嵌套循环中,以保持脚本 运行 当文件夹中有文件要处理时,新代码(导致错误的代码)在下面;
# -*- coding: utf-8 -*-
import csv
import pandas
import numpy as np
import sklearn.ensemble as ske
import re
import os
import time
import collections
import pickle
from sklearn.externals import joblib
from sklearn import model_selection, tree, linear_model, svm
# Our arrays which we'll store our process details in and then later print out data for
Class_0 = []
Class_1 = []
prob = []
results = []
# Open file to output our report too
timestr = time.strftime("%Y%m%d%H%M%S")
f = open(timestr + " output report.txt", 'w')
f.write(timestr + " output report\n")
f.write("\n")
count = len(os.listdir('.'))
while (count > 0):
# Load dataset
for filename in os.listdir('.'):
if filename.endswith('.csv') and filename.startswith("processes_"):
url = filename
dataset = pandas.read_csv(url)
dataset.set_index('Name', inplace = True)
clf = joblib.load(os.path.join(
os.path.dirname(os.path.realpath(__file__)),
'classifier/classifier.pkl'))
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
if index in Class_0:
Class_0.append(index)
elif index in Class_1:
Class_1.append(index)
else:
print "Is ", index, " recognised?"
designation = raw_input()
if designation == "No":
Class_0.append(index)
else:
Class_1.append(index)
dataset['Type'] = 1
dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0
print "\n"
results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
print (results)
X = dataset.drop(['Type'], axis=1).values
Y = dataset['Type'].values
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')
os.remove(filename)
output = collections.Counter(Class_0)
print "Class_0; \n"
f.write ("Class_0; \n")
for key, value in output.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.write ("\n")
output_1 = collections.Counter(Class_1)
print "Class_1; \n"
f.write ("Class_1; \n")
for key, value in output_1.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.close()
错误 (IndexError: index 1 is out of bounds for size 1
) 引用了预测线 res = clf.predict([row])
。据我所知,问题是没有足够的 "classes" 或数据标签类型(我要使用二元分类器)?但是我之前一直在使用这种确切的方法(在嵌套循环之外)而没有任何问题。
https://codeshare.io/Gkpb44 - 包含上述 .csv 文件的我的 .csv 数据的代码共享 link。
问题是 [row]
是一个长度为 1 的数组。您的程序试图访问不存在的索引 1(索引从 0 开始)。看起来您可能想要执行 res = clf.predict(row)
或再次查看行变量。希望这有帮助。
所以我意识到问题是什么了。
我创建了一种格式,其中加载了 classifier,然后使用 warm_start 我重新拟合数据以更新 classifier 以尝试模拟增量/ 在线学习。当我处理其中包含两种类型 class 的数据时,这很有效。但是,如果数据只是正数,那么当我重新安装 classifier 时它会破坏它。
现在我已经注释掉了以下内容;
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')
已解决问题。展望未来,我可能会添加(又一个!)条件语句以查看我是否应该重新拟合数据。
我很想删除这个问题,但是由于我在搜索过程中没有找到任何涵盖这个事实的内容,所以我想我会留下这个答案,以防有人发现他们有同样的问题。
我有一个训练有素的分类器, 工作正常。
我试图修改它以使用循环处理多个 .csv 文件,但这已经破坏了它,以至于原始代码(工作正常)现在 returns 同样的错误对于 .csv 文件,它之前处理过没有任何问题。
我很困惑,看不出在之前一切正常的情况下突然出现此错误的原因。原始(工作)代码是;
# -*- coding: utf-8 -*-
import csv
import pandas
import numpy as np
import sklearn.ensemble as ske
import re
import os
import collections
import pickle
from sklearn.externals import joblib
from sklearn import model_selection, tree, linear_model, svm
# Load dataset
url = 'test_6_During_100.csv'
dataset = pandas.read_csv(url)
dataset.set_index('Name', inplace = True)
##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company',
## 'UserProcessorTime','Path','Product','Description',]]
# Open file to output everything to
new_url = re.sub('\.csv$', '', url)
f = open(new_url + " output report", 'w')
f.write(new_url + " output report\n")
f.write("\n")
# shape
print(dataset.shape)
print("\n")
f.write("Dataset shape " + str(dataset.shape) + "\n")
f.write("\n")
clf = joblib.load(os.path.join(
os.path.dirname(os.path.realpath(__file__)),
'classifier/classifier.pkl'))
Class_0 = []
Class_1 = []
prob = []
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
if index in malware:
Class_0.append(index)
elif index in Class_1:
Class_1.append(index)
else:
print "Is ", index, " recognised?"
designation = raw_input()
if designation == "No":
Class_0.append(index)
else:
Class_1.append(index)
dataset['Type'] = 1
dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0
print "\n"
results = []
results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
print (results)
X = dataset.drop(['Type'], axis=1).values
Y = dataset['Type'].values
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')
output = collections.Counter(Class_0)
print "Class_0; \n"
f.write ("Class_0; \n")
for key, value in output.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.write ("\n")
output_1 = collections.Counter(Class_1)
print "Class_1; \n"
f.write ("Class_1; \n")
for key, value in output_1.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.close()
我的新代码是相同的,但包裹在几个嵌套循环中,以保持脚本 运行 当文件夹中有文件要处理时,新代码(导致错误的代码)在下面;
# -*- coding: utf-8 -*-
import csv
import pandas
import numpy as np
import sklearn.ensemble as ske
import re
import os
import time
import collections
import pickle
from sklearn.externals import joblib
from sklearn import model_selection, tree, linear_model, svm
# Our arrays which we'll store our process details in and then later print out data for
Class_0 = []
Class_1 = []
prob = []
results = []
# Open file to output our report too
timestr = time.strftime("%Y%m%d%H%M%S")
f = open(timestr + " output report.txt", 'w')
f.write(timestr + " output report\n")
f.write("\n")
count = len(os.listdir('.'))
while (count > 0):
# Load dataset
for filename in os.listdir('.'):
if filename.endswith('.csv') and filename.startswith("processes_"):
url = filename
dataset = pandas.read_csv(url)
dataset.set_index('Name', inplace = True)
clf = joblib.load(os.path.join(
os.path.dirname(os.path.realpath(__file__)),
'classifier/classifier.pkl'))
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
if index in Class_0:
Class_0.append(index)
elif index in Class_1:
Class_1.append(index)
else:
print "Is ", index, " recognised?"
designation = raw_input()
if designation == "No":
Class_0.append(index)
else:
Class_1.append(index)
dataset['Type'] = 1
dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0
print "\n"
results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
print (results)
X = dataset.drop(['Type'], axis=1).values
Y = dataset['Type'].values
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')
os.remove(filename)
output = collections.Counter(Class_0)
print "Class_0; \n"
f.write ("Class_0; \n")
for key, value in output.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.write ("\n")
output_1 = collections.Counter(Class_1)
print "Class_1; \n"
f.write ("Class_1; \n")
for key, value in output_1.items():
f.write(str(key) + " ; " + str(value) + "\n")
print(str(key) + " ; " + str(value))
print "\n"
f.close()
错误 (IndexError: index 1 is out of bounds for size 1
) 引用了预测线 res = clf.predict([row])
。据我所知,问题是没有足够的 "classes" 或数据标签类型(我要使用二元分类器)?但是我之前一直在使用这种确切的方法(在嵌套循环之外)而没有任何问题。
https://codeshare.io/Gkpb44 - 包含上述 .csv 文件的我的 .csv 数据的代码共享 link。
问题是 [row]
是一个长度为 1 的数组。您的程序试图访问不存在的索引 1(索引从 0 开始)。看起来您可能想要执行 res = clf.predict(row)
或再次查看行变量。希望这有帮助。
所以我意识到问题是什么了。
我创建了一种格式,其中加载了 classifier,然后使用 warm_start 我重新拟合数据以更新 classifier 以尝试模拟增量/ 在线学习。当我处理其中包含两种类型 class 的数据时,这很有效。但是,如果数据只是正数,那么当我重新安装 classifier 时它会破坏它。
现在我已经注释掉了以下内容;
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')
已解决问题。展望未来,我可能会添加(又一个!)条件语句以查看我是否应该重新拟合数据。
我很想删除这个问题,但是由于我在搜索过程中没有找到任何涵盖这个事实的内容,所以我想我会留下这个答案,以防有人发现他们有同样的问题。