Python - 由于 headers 在转换中被捕获,无法将字符串转换为浮点数
Python - Cannot convert string to float due to headers being captured in the conversion
我试图在 Python 中将特定列从字符串转换为浮点数,但我总是会以错误告终:
cannot convert string to float: 'Tour Delay Minutes'
Tour Delay Minutes 是特定列的名称,包含 6.31 或整数(如果结果是整数)如 9,10 等值。我的代码是:
import pandas as pd
import numpy as np
data = pd.read_csv('H:\testing.csv',thousands = ',')
data.drop([0], axis=1) #Removes the header? based on another post
cols=['Tour Delay Minutes','Passenger Delay Minutes','Driver Delay Minutes','Engine Failures','Vehicle Failures'] #Columns containing ints and floats
for col in cols: #Loop to transform all column strings to floats by default
data[col]= data[col].astype(dtype=np.float64)
data.info()
加载时指定的数据类型是:
Unnamed: 0 int64
Time Period object #contains day,midday,early afternoon
Tour Number object #contains integers
Tour Delay Minutes object #contains float numbers
Passenger Delay Minutes object #contains float numbers
Driver Delay Minutes object #contains float numbers
Engine Failures object #contains integer numbers
Vehicle Failures object #contains integer numbers
我想该错误也适用于标记为 object 的所有其他列(如上所示),那是因为 Python 也尝试转换 header(第 1 行).请问有什么解决方法吗?我也试过下面的代码,但它没有用:
data['Tour Delay Minutes'].astype(str).astype(float)
编辑:添加示例数据集以帮助找到解决方案 - 请参阅 link:
https://i.stack.imgur.com/o4zcX.png
Unnamed: 0 (index) Time Period Tour Number Tour Delay Minutes Passenger Delay Minutes Driver Delay Minutes Engine Failures Vehicle Failures
0 2018/19-P08 261803 11 6 5 2 0
1 2018/19-P08 325429 16 12 4 0 0
2 2018/19-P08 359343 14 5 9 0 0
3 2018/19-P08 366609 18 10 8 0 0
4 2018/19-P08 370697 63 37 26 2 0
5 2018/19-P08 392535 1474 140 1334 37.1194012 0.022591857
6 2018/19-P09 394752 0 0 0 0 0
7 2018/19-P09 408713 31 13 18 1.25 0
8 2018/19-P09 433763 62 49 13 4.766666667 1
9 2018/19-P09 440100 0 0 0 1 1
10 2018/19-P09 440258 17 14 3 1 0
11 2018/19-P10 440280 46 46 0 2.933333333 2
12 2018/19-P10 440929 22 7 15 1 0
13 2018/19-P10 441110 26 13 13 0 0
14 2018/19-P10 441585 4 0 4 0 0
15 2018/19-P10 442092 39 12 27 1.923076923 0
16 2018/19-P11 442105 0 0 0 0 0
17 2018/19-P11 442173 3 0 3 0 0
18 2018/19-P11 443580 4 2 2 0.428571429 0
19 2018/19-P11 443594 3 2 1 0.285714286 0
20 2018/19-P12 443599 2 1 1 0.285714286 0
21 2018/19-P12 443709 5 0 5 0 0
22 2018/19-P12 443885 3 0 3 0 0
23 2018/19-P12 444040 15 9 6 0.857142857 0
24 2018/19-P12 445021 3 0 3 0 0
编辑 2:添加了实际样本数据集 - 图片 link 仍然可用
几天后,测试文件和 运行 不同的脚本,我想我的问题已经解决了。
The issue was that duplicate headers were being added from another CSV
combine script and that caused problems when trying to convert the
columns of my master file.
我现在的问题是,如何从主 CSV 文件(大约 1700 万行)中删除重复项 headers?
编辑:重复 headers 问题已解决
我遵循了 jezrael 提供的建议,并且能够成功 eliminate/remove 所有包含 headers 的重复行。 Link 附加给感兴趣的人。非常感谢为解决此问题做出贡献的用户。
y = x[~x['Time Period'].str.contains('Time Period')]
#The above helped me remove all applicable rows that contained the string "Time Period
y.to_csv("H:\modded.csv")
data1 = pd.read_csv("H:\modded.csv")
data1.dtypes
#I then save "y" as new CSV file, load the new dataset and voila the columns containing numbers have changed dtypes to float64.
How to drop rows from pandas data frame that contains a particular string in a particular column?
如果您遇到类似的问题并且这个答案已经解决了,请不要忘记给问题和答案都点赞。
我试图在 Python 中将特定列从字符串转换为浮点数,但我总是会以错误告终:
cannot convert string to float: 'Tour Delay Minutes'
Tour Delay Minutes 是特定列的名称,包含 6.31 或整数(如果结果是整数)如 9,10 等值。我的代码是:
import pandas as pd
import numpy as np
data = pd.read_csv('H:\testing.csv',thousands = ',')
data.drop([0], axis=1) #Removes the header? based on another post
cols=['Tour Delay Minutes','Passenger Delay Minutes','Driver Delay Minutes','Engine Failures','Vehicle Failures'] #Columns containing ints and floats
for col in cols: #Loop to transform all column strings to floats by default
data[col]= data[col].astype(dtype=np.float64)
data.info()
加载时指定的数据类型是:
Unnamed: 0 int64
Time Period object #contains day,midday,early afternoon
Tour Number object #contains integers
Tour Delay Minutes object #contains float numbers
Passenger Delay Minutes object #contains float numbers
Driver Delay Minutes object #contains float numbers
Engine Failures object #contains integer numbers
Vehicle Failures object #contains integer numbers
我想该错误也适用于标记为 object 的所有其他列(如上所示),那是因为 Python 也尝试转换 header(第 1 行).请问有什么解决方法吗?我也试过下面的代码,但它没有用:
data['Tour Delay Minutes'].astype(str).astype(float)
编辑:添加示例数据集以帮助找到解决方案 - 请参阅 link:
https://i.stack.imgur.com/o4zcX.png
Unnamed: 0 (index) Time Period Tour Number Tour Delay Minutes Passenger Delay Minutes Driver Delay Minutes Engine Failures Vehicle Failures
0 2018/19-P08 261803 11 6 5 2 0
1 2018/19-P08 325429 16 12 4 0 0
2 2018/19-P08 359343 14 5 9 0 0
3 2018/19-P08 366609 18 10 8 0 0
4 2018/19-P08 370697 63 37 26 2 0
5 2018/19-P08 392535 1474 140 1334 37.1194012 0.022591857
6 2018/19-P09 394752 0 0 0 0 0
7 2018/19-P09 408713 31 13 18 1.25 0
8 2018/19-P09 433763 62 49 13 4.766666667 1
9 2018/19-P09 440100 0 0 0 1 1
10 2018/19-P09 440258 17 14 3 1 0
11 2018/19-P10 440280 46 46 0 2.933333333 2
12 2018/19-P10 440929 22 7 15 1 0
13 2018/19-P10 441110 26 13 13 0 0
14 2018/19-P10 441585 4 0 4 0 0
15 2018/19-P10 442092 39 12 27 1.923076923 0
16 2018/19-P11 442105 0 0 0 0 0
17 2018/19-P11 442173 3 0 3 0 0
18 2018/19-P11 443580 4 2 2 0.428571429 0
19 2018/19-P11 443594 3 2 1 0.285714286 0
20 2018/19-P12 443599 2 1 1 0.285714286 0
21 2018/19-P12 443709 5 0 5 0 0
22 2018/19-P12 443885 3 0 3 0 0
23 2018/19-P12 444040 15 9 6 0.857142857 0
24 2018/19-P12 445021 3 0 3 0 0
编辑 2:添加了实际样本数据集 - 图片 link 仍然可用
几天后,测试文件和 运行 不同的脚本,我想我的问题已经解决了。
The issue was that duplicate headers were being added from another CSV combine script and that caused problems when trying to convert the columns of my master file.
我现在的问题是,如何从主 CSV 文件(大约 1700 万行)中删除重复项 headers?
编辑:重复 headers 问题已解决
我遵循了 jezrael 提供的建议,并且能够成功 eliminate/remove 所有包含 headers 的重复行。 Link 附加给感兴趣的人。非常感谢为解决此问题做出贡献的用户。
y = x[~x['Time Period'].str.contains('Time Period')]
#The above helped me remove all applicable rows that contained the string "Time Period
y.to_csv("H:\modded.csv")
data1 = pd.read_csv("H:\modded.csv")
data1.dtypes
#I then save "y" as new CSV file, load the new dataset and voila the columns containing numbers have changed dtypes to float64.
How to drop rows from pandas data frame that contains a particular string in a particular column?
如果您遇到类似的问题并且这个答案已经解决了,请不要忘记给问题和答案都点赞。