pandas 读取格式错误的 CSV

Question

我收到了一个 CSV 文件，其中 , 是用于分隔字段的分隔符，但不幸的是附加了表示小数点的符号（德语表示法）。

因此，某些行的列数会有所不同。奇怪的是 excel 会很好地解析/读取文件。是否也可以在 pandas 中读取此类文件？到目前为止，我只得到类似于

的东西

Error tokenizing data. C error: Expected 97 fields in line 3, saw 98

编辑

这是一个最小的例子：

pd.read_csv(os.path.expanduser('~/Downloads/foo.csv'), sep=',', decimal=',')

with ~/Downloads/foo.csv 文件内容为

first, number, third
some, 1, other
foo, 1.5, bar
baz, 1,5, some

当我在 R 中加载数据时

See spec(...) for full column specifications.
Warnung: 1538 parsing failures.
row col   expected      actual
  1  -- 93 columns 97 columns 
  2  -- 93 columns 98 columns 
  3  -- 93 columns 97 columns 
  4  -- 93 columns 102 columns
  5  -- 93 columns 99 columns

pandas有这么宽容的模式吗？

Answer 1

确保您的文件中没有您应该向 read_csv 声明的引号定界符。

如果您的文件是 ill-formed，则在数学上没有确定性算法可以确定带逗号的一连串字符是两个字段，还是只有一个以逗号分隔的数字。

您将不得不编写一个预处理器，它使用接近文件实际情况的 ad-hoc 算法对 ill-formed 数据执行 clean-up。这可能很讨厌 我假设数字后跟逗号后跟 3 位数字实际上是相同的字段 以及这些修复的任何其他变体。

你也可能遇到这样的情况，即使那样也不确定，那么你别无选择，只能转到数据源并要求另一种文件格式的数据修复。

要删除错误的行并加载其他行，文档中的这些参数将有所帮助：

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

warn_bad_lines : boolean, default True If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).

pandas 读取格式错误的 CSV

pandas read malformed CSV

python

csv

separator

malformed

pandas

编辑