Python - 由于 headers 在转换中被捕获,无法将字符串转换为浮点数

Python - Cannot convert string to float due to headers being captured in the conversion

我试图在 Python 中将特定列从字符串转换为浮点数,但我总是会以错误告终:

cannot convert string to float: 'Tour Delay Minutes'

Tour Delay Minutes 是特定列的名称,包含 6.31 或整数(如果结果是整数)如 9,10 等值。我的代码是:

import pandas as pd
import numpy as np

data = pd.read_csv('H:\testing.csv',thousands = ',')
data.drop([0], axis=1) #Removes the header? based on another post
cols=['Tour Delay Minutes','Passenger Delay Minutes','Driver Delay Minutes','Engine Failures','Vehicle Failures'] #Columns containing ints and floats 
for col in cols: #Loop to transform all column strings to floats by default
    data[col]= data[col].astype(dtype=np.float64)
data.info()

加载时指定的数据类型是:

Unnamed: 0                         int64
Time Period                       object #contains day,midday,early afternoon
Tour Number                       object #contains integers
Tour Delay Minutes                object #contains float numbers
Passenger Delay Minutes           object #contains float numbers
Driver Delay Minutes              object #contains float numbers
Engine Failures                   object #contains integer numbers
Vehicle Failures                  object #contains integer numbers

我想该错误也适用于标记为 object 的所有其他列(如上所示),那是因为 Python 也尝试转换 header(第 1 行).请问有什么解决方法吗?我也试过下面的代码,但它没有用:

data['Tour Delay Minutes'].astype(str).astype(float)

编辑:添加示例数据集以帮助找到解决方案 - 请参阅 link:

https://i.stack.imgur.com/o4zcX.png

 Unnamed: 0 (index) Time Period Tour Number Tour Delay Minutes  Passenger Delay Minutes Driver Delay Minutes    Engine Failures Vehicle Failures
0   2018/19-P08 261803  11  6   5   2   0
1   2018/19-P08 325429  16  12  4   0   0
2   2018/19-P08 359343  14  5   9   0   0
3   2018/19-P08 366609  18  10  8   0   0
4   2018/19-P08 370697  63  37  26  2   0
5   2018/19-P08 392535  1474    140 1334    37.1194012  0.022591857
6   2018/19-P09 394752  0   0   0   0   0
7   2018/19-P09 408713  31  13  18  1.25    0
8   2018/19-P09 433763  62  49  13  4.766666667 1
9   2018/19-P09 440100  0   0   0   1   1
10  2018/19-P09 440258  17  14  3   1   0
11  2018/19-P10 440280  46  46  0   2.933333333 2
12  2018/19-P10 440929  22  7   15  1   0
13  2018/19-P10 441110  26  13  13  0   0
14  2018/19-P10 441585  4   0   4   0   0
15  2018/19-P10 442092  39  12  27  1.923076923 0
16  2018/19-P11 442105  0   0   0   0   0
17  2018/19-P11 442173  3   0   3   0   0
18  2018/19-P11 443580  4   2   2   0.428571429 0
19  2018/19-P11 443594  3   2   1   0.285714286 0
20  2018/19-P12 443599  2   1   1   0.285714286 0
21  2018/19-P12 443709  5   0   5   0   0
22  2018/19-P12 443885  3   0   3   0   0
23  2018/19-P12 444040  15  9   6   0.857142857 0
24  2018/19-P12 445021  3   0   3   0   0

编辑 2:添加了实际样本数据集 - 图片 link 仍然可用

几天后,测试文件和 运行 不同的脚本,我想我的问题已经解决了。

The issue was that duplicate headers were being added from another CSV combine script and that caused problems when trying to convert the columns of my master file.

我现在的问题是,如何从主 CSV 文件(大约 1700 万行)中删除重复项 headers?

编辑:重复 headers 问题已解决

我遵循了 jezrael 提供的建议,并且能够成功 eliminate/remove 所有包含 headers 的重复行。 Link 附加给感兴趣的人。非常感谢为解决此问题做出贡献的用户。

 y = x[~x['Time Period'].str.contains('Time Period')]
#The above helped me remove all applicable rows that contained the string "Time Period

    y.to_csv("H:\modded.csv")
data1 = pd.read_csv("H:\modded.csv")
data1.dtypes
#I then save "y" as new CSV file, load the new dataset and voila the columns containing numbers have changed dtypes to float64.

How to drop rows from pandas data frame that contains a particular string in a particular column?

如果您遇到类似的问题并且这个答案已经解决了,请不要忘记给问题和答案都点赞。