Python pandas 加载 csv 文件时缺少一些行尾字符

Question

我有一个用德语编写的 csv 文件（制表符分隔）。我没有创建文件。我尝试使用 Python 的 pandas 包读取该文件。我执行以下操作：

import pandas as pd
trn_file ="data/train.csv"
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None)
# pd_train is [1153 rows x 12 columns]
# the first  couple of rows of pd_train can be seen below:
>>> pd_train
        0                                                  1                                     2    3           4   5   6                                                7                                                8                      9     10    11
0       35  Auch in Großbritannien, wo 19 Atomreaktoren in...                              Ausstieg -1.0  2011-03-13  10  10                                     Sunday Times                                     Sunday Times           Sunday Times   NaN     1
1      117  Deswegen sollte Deutschland nicht für [...] we...                              Ausstieg  1.0  2011-04-11  60  62                                 Dietram Hoffmann                                 Dietram Hoffmann                    NaN   NaN   121

当我调查数据框时，我意识到文件没有正确解析。我的意思是，即使它们之间有换行符，我也看到看起来合并的行。例如，下面的示例显示了一个句子，但实际上它包含 4 个句子。（它们应该在数据框中的单独行中）：

>>> pd_train[1][483]
'Wer keine Brücke will, kann auch keine Brückenmaut verlangen. Eine Klage gegen die Kernbrennstoffsteuer schließe ich nicht aus.\tKonsens/Einigkeit\t-1.0\t2011-05-03\t90\t91\tEon\tJohannes Teyssen\tEon\t\t558\n3\tEin solches schicksalhaftes Langzeitprojekt ist für einen kurzsichtigen Profilierungswettstreit der Parteien ungeeignet. Deshalb müssen wir einen Konsens finden, der von einer breiten Mehrheit auf Dauer getragen wird.\tKonsens/Einigkeit\t1.0\t2011-05-10\t50\t55\tAlois Glück\tAlois Glück\tZentralkomitee der Katholiken\t31.0\t576\n1459\tWir brauchen jetzt keine Kommissionen, sondern einen neuen, breiten Konsens, der dann wirklich hält.\tKonsens/Einigkeit\t1.0\t2011-04-12\t30\t30\tClaudia Roth\tClaudia Roth\tGrüne\t34.0\t671\n1745\tDie Parteispitze zeigt sich offen für einen Konsens. Das würde die Richtigkeit des Atomausstiegs und des grünen Kurses besiegeln", sagt Steffi Lemke, politische Geschäftsführerin der Grünen.'

我该如何解决这个问题？

如果我需要提供更多信息，请告诉我。

编辑我尝试了@abby 的建议。当我给出完整路径时，没有任何改变，当我删除分隔符和编码参数时，我得到以下错误：

pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12

Answer 1

问题是某些文本条目包含引号字符。它们掩盖了分隔符和换行符。通过指定 quoting = csv.QUOTE_NONE 您可以关闭这种对引号字符的特殊处理。所以使用

pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None,quoting = csv.QUOTE_NONE)

读取偶尔带有引号字符的文件。见 https://docs.python.org/3/library/csv.html:

csv.QUOTE_NONE

Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character. If escapechar is not set, the writer will raise Error if any characters that require escaping are encountered.

Instructs reader to perform no special processing of quote characters.

Answer 2

pd.read_csv(”train.csv", quotechar='"',skipinitialspace=True)
quotechar=‘”’ --  Any commas between these characters shouldn’t be treated as new columns.

skipinitialspace=True --  Skip spaces after delimiter.

Python pandas 加载 csv 文件时缺少一些行尾字符

Python pandas misses some end of line characters while loading csv file

python

file-processing

pandas