Pandas 正在读取 header 中制表符不一致的文件

Question

我正在读取一个数据集，其中的列已添加到历史文件中，我想始终如一地读取这些文件。问题是旧文件缺少一列 header 中的制表符数量不正确，这导致第一列被读取为索引。

bad.csv

Col1    Col2    Col3    Col4    Col5
6   2   3           
5   2   4

一个 good.csv 正确加载

Col1    Col2    Col3    Col4    Col5    Col6
6   2   3           
5   2   4

我正在读取 csvs 文件df = pd.read_csv('bad.csv', sep='\t')

我可以通过查看索引来检测文件是否损坏，如何更正损坏的文件，以便在 Col1 不作为索引的一部分的情况下加载它？我试过 df.shift(1, axis=1) 但这不包括索引，我可以在移动后设置它，但我担心这可能会产生更多问题。例如：

df = df.shift(1,axis=1)
df.Col1 = df.index

有没有更好的方法？

Answer 1

根据 docs:

Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

所以我确保每一行都以 \t

结尾

bad.csv:

col1    col2    col3    col4    col5
2   4   6   8   10
3   5   8   10  13
4   8   12  16  20  24
15  13  11  9   7   5
1   1   2   3   5   8

然后：

df = pd.read_csv('bad.csv', sep='\t', index_col=False)

结果

   col1  col2  col3  col4  col5  Unnamed: 5
0     2     4     6     8    10         NaN
1     3     5     8    10    13         NaN
2     4     8    12    16    20        24.0
3    15    13    11     9     7         5.0
4     1     1     2     3     5         8.0

Pandas 正在读取 header 中制表符不一致的文件

Pandas reading file with inconsistent tabs in header

csv

pandas