Pandas：错误标记数据——使用 glob.glob 时

Question

我正在使用以下代码连接我从 here 下载的几个文件（候选主文件）；但也可以在这里找到它们：

https://github.com/108michael/ms_thesis/blob/master/cn06.txt
https://github.com/108michael/ms_thesis/blob/master/cn08.txt
https://github.com/108michael/ms_thesis/blob/master/cn10.txt
https://github.com/108michael/ms_thesis/blob/master/cn12.txt
https://github.com/108michael/ms_thesis/blob/master/cn14.txt

import numpy as np
import pandas as pd
import glob


df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6'  ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
        ('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))

我收到以下错误：

CParserError: Error tokenizing data. C error: Expected 2 fields in line 58, saw 4

有人对此有线索吗？

Answer 1

pd.read_csv 的默认分隔符是逗号 ,。由于所有候选人的姓名都以 Last, First 格式列出，因此 pandas 读取两列：逗号之前的所有内容和逗号之后的所有内容。在其中一个文件中，有额外的逗号，引导 pandas 假设有更多的列。那是解析器错误。

要使用 | 作为分隔符而不是 ,，只需更改您的代码以使用关键字 delimiter="|" 或 sep="|"。从docs可以看出，delimiter和sep是同一个关键字的别名。

新代码：

df = pd.concat((pd.read_csv(f, header=None, delimiter="|", names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6'  ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
    ('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))

Answer 2

import numpy as np
import pandas as pd
import glob


df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname', \
    'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
    'cand_status', '1', '2','3','4', '5', '6'  ],sep='|', \
    usecols=['feccandid', 'party', 'date', 'state', 'chamber'] \
    )for f in glob.glob\
    (/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))
print len(df)

Pandas：错误标记数据——使用 glob.glob 时

Pandas: Error tokenizing data--when using glob.glob

python

glob

concatenation

pandas