为什么此程序无法将字符串转换为 Python 中的浮点数
Why this program could not convert string to float in Python
这有什么问题吗?
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
#membagi array
X = array[:,0:9]
Y = array[:,9]
skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)
#data hasil
set_printoptions(precision = 3)
print(normalisasiX[0:10,:])
当我运行这个程序
File "C:\Users\Dini\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'ib'
请帮助我
我在看 the docs(@OliverRadini 提到的同一个页面),同一页面状态如下:
header : int, list of int, default ‘infer’
Row number(s) to use as the
column names, and the start of the data. Default behavior is to infer
the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the
file, if column names are passed explicitly then the behavior is
identical to header=None
. Explicitly pass header=0
to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns e.g. [0,1,3].
Intervening rows that are not specified will be skipped (e.g. 2 in
this example is skipped). Note that this parameter ignores commented
lines and empty lines if skip_blank_lines=True
, so header=0
denotes
the first line of data rather than the first line of the file
您是在代码中定义名称,因此不应在文件中包含 header。要么做一个(在 csv 数据中写入 headers),要么做另一个(在代码中写入列名)。不要两者都做。
编辑:我的答案保持不变,但这是您自己发现的一种方法:
使用以下 csv 数据(您在图片中显示的内容):
BULAN,rt,nigak,niagab,sosum,soskhus,p,tni,ik,ib,TARGET
13-Jan,84876,902,1192,2098,3623,169,39,133,1063,94095
13-Feb,79194,902,1050,2109,3606,153,39,133,806,87992
13-Mar,75836,902,1060,1905,3166,161,39,133,785,83987
13-Apr,75571,902,112,1878,3190,158,39,133,635,82618
13-May,83797,1156,134,1900,3518,218,39,133,709,91604
13-Jun,91648,1291,127,2220,3596,249,39,133,659,99967
13-Jul,79063,1346,107,1844,3428,247,39,133,951,86798
运行 这个代码...
from pandas import read_csv
from numpy import set_printoptions
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
print("with names=nama...")
print(array)
dataFrame = read_csv(namaFile)
array = dataFrame.values
print("with no names...")
print(array)
dataFrame = read_csv(namaFile, names=nama, header=0)
array = dataFrame.values
print("with no names=nama and header=0...")
print(array)
你得到这个输出:
with names=nama...
[['rt' 'nigak' 'niagab' 'sosum' 'soskhus' 'p' 'tni' 'ik' 'ib' 'TARGET']
['84876' '902' '1192' '2098' '3623' '169' '39' '133' '1063' '94095']
['79194' '902' '1050' '2109' '3606' '153' '39' '133' '806' '87992']
['75836' '902' '1060' '1905' '3166' '161' '39' '133' '785' '83987']
['75571' '902' '112' '1878' '3190' '158' '39' '133' '635' '82618']
['83797' '1156' '134' '1900' '3518' '218' '39' '133' '709' '91604']
['91648' '1291' '127' '2220' '3596' '249' '39' '133' '659' '99967']
['79063' '1346' '107' '1844' '3428' '247' '39' '133' '951' '86798']]
with no names...
[['13-Jan' 84876 902 1192 2098 3623 169 39 133 1063 94095]
['13-Feb' 79194 902 1050 2109 3606 153 39 133 806 87992]
['13-Mar' 75836 902 1060 1905 3166 161 39 133 785 83987]
['13-Apr' 75571 902 112 1878 3190 158 39 133 635 82618]
['13-May' 83797 1156 134 1900 3518 218 39 133 709 91604]
['13-Jun' 91648 1291 127 2220 3596 249 39 133 659 99967]
['13-Jul' 79063 1346 107 1844 3428 247 39 133 951 86798]]
with no names=nama and header=0...
[[84876 902 1192 2098 3623 169 39 133 1063 94095]
[79194 902 1050 2109 3606 153 39 133 806 87992]
[75836 902 1060 1905 3166 161 39 133 785 83987]
[75571 902 112 1878 3190 158 39 133 635 82618]
[83797 1156 134 1900 3518 218 39 133 709 91604]
[91648 1291 127 2220 3596 249 39 133 659 99967]
[79063 1346 107 1844 3428 247 39 133 951 86798]]
这里我们可以看得很清楚,当你把名字都包含进去的时候,你得到的是第一项列出的header,这不是我们想要的。当您删除 names=nama
时,您将从文件中获取所有数据。当您显式 over-write 带有 names=nama header=0
的名称时,您也可以达到此预期结果。 不过我还想指出,您的代码中的 headers 缺少 BULAN 列,所以要小心。
print()
是你的朋友。用它。它会告诉你你的问题是什么。
这有什么问题吗?
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
#membagi array
X = array[:,0:9]
Y = array[:,9]
skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)
#data hasil
set_printoptions(precision = 3)
print(normalisasiX[0:10,:])
当我运行这个程序
File "C:\Users\Dini\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'ib'
我在看 the docs(@OliverRadini 提到的同一个页面),同一页面状态如下:
header : int, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, soheader=0
denotes the first line of data rather than the first line of the file
您是在代码中定义名称,因此不应在文件中包含 header。要么做一个(在 csv 数据中写入 headers),要么做另一个(在代码中写入列名)。不要两者都做。
编辑:我的答案保持不变,但这是您自己发现的一种方法:
使用以下 csv 数据(您在图片中显示的内容):
BULAN,rt,nigak,niagab,sosum,soskhus,p,tni,ik,ib,TARGET
13-Jan,84876,902,1192,2098,3623,169,39,133,1063,94095
13-Feb,79194,902,1050,2109,3606,153,39,133,806,87992
13-Mar,75836,902,1060,1905,3166,161,39,133,785,83987
13-Apr,75571,902,112,1878,3190,158,39,133,635,82618
13-May,83797,1156,134,1900,3518,218,39,133,709,91604
13-Jun,91648,1291,127,2220,3596,249,39,133,659,99967
13-Jul,79063,1346,107,1844,3428,247,39,133,951,86798
运行 这个代码...
from pandas import read_csv
from numpy import set_printoptions
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
print("with names=nama...")
print(array)
dataFrame = read_csv(namaFile)
array = dataFrame.values
print("with no names...")
print(array)
dataFrame = read_csv(namaFile, names=nama, header=0)
array = dataFrame.values
print("with no names=nama and header=0...")
print(array)
你得到这个输出:
with names=nama...
[['rt' 'nigak' 'niagab' 'sosum' 'soskhus' 'p' 'tni' 'ik' 'ib' 'TARGET']
['84876' '902' '1192' '2098' '3623' '169' '39' '133' '1063' '94095']
['79194' '902' '1050' '2109' '3606' '153' '39' '133' '806' '87992']
['75836' '902' '1060' '1905' '3166' '161' '39' '133' '785' '83987']
['75571' '902' '112' '1878' '3190' '158' '39' '133' '635' '82618']
['83797' '1156' '134' '1900' '3518' '218' '39' '133' '709' '91604']
['91648' '1291' '127' '2220' '3596' '249' '39' '133' '659' '99967']
['79063' '1346' '107' '1844' '3428' '247' '39' '133' '951' '86798']]
with no names...
[['13-Jan' 84876 902 1192 2098 3623 169 39 133 1063 94095]
['13-Feb' 79194 902 1050 2109 3606 153 39 133 806 87992]
['13-Mar' 75836 902 1060 1905 3166 161 39 133 785 83987]
['13-Apr' 75571 902 112 1878 3190 158 39 133 635 82618]
['13-May' 83797 1156 134 1900 3518 218 39 133 709 91604]
['13-Jun' 91648 1291 127 2220 3596 249 39 133 659 99967]
['13-Jul' 79063 1346 107 1844 3428 247 39 133 951 86798]]
with no names=nama and header=0...
[[84876 902 1192 2098 3623 169 39 133 1063 94095]
[79194 902 1050 2109 3606 153 39 133 806 87992]
[75836 902 1060 1905 3166 161 39 133 785 83987]
[75571 902 112 1878 3190 158 39 133 635 82618]
[83797 1156 134 1900 3518 218 39 133 709 91604]
[91648 1291 127 2220 3596 249 39 133 659 99967]
[79063 1346 107 1844 3428 247 39 133 951 86798]]
这里我们可以看得很清楚,当你把名字都包含进去的时候,你得到的是第一项列出的header,这不是我们想要的。当您删除 names=nama
时,您将从文件中获取所有数据。当您显式 over-write 带有 names=nama header=0
的名称时,您也可以达到此预期结果。 不过我还想指出,您的代码中的 headers 缺少 BULAN 列,所以要小心。
print()
是你的朋友。用它。它会告诉你你的问题是什么。