从 csv 中删除不需要的行时出现 IndexError

Question

我不知道为什么会收到这个，我想跳过包含“?”的行作为列值。 Example of dataset

csv 文件示例：

59, Private, 109015, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K
56, Local-gov, 216851, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K
19, Private, 168294, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
39, Private, 367260, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K
49, Private, 193366, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K
23, Local-gov, 190709, Assoc-acdm, 12, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K
20, Private, 266015, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 44, United-States, <=50K
45, Private, 386940, Bachelors, 13, Divorced, Exec-managerial, Own-child, White, Male, 0, 1408, 40, United-States, <=50K
30, Federal-gov, 59951, Some-college, 10, Married-civ-spouse, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K
18, Private, 226956, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, ?, <=50K

我正在使用 python，这是我的代码：


# Load the adult dataset
import csv
f = open("./adult_data.csv")
records = csv.reader(f, delimiter = ',')

# We define a header ourselves since the dataset contains only the raw numbers.
dataset = []
header = ['Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'Marital-status', 'Occupation',
  'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native-    country', 'Salary'
]

for line in records:
  question_mark = True
for i in range(len(header)):
  if (line[i] == ' ?'):
    question_mark = False
if (question_mark):
  d = dict(zip(header, line))
d['Age'] = int(d['Age'])
d['Fnlwgt'] = int(d['Fnlwgt'])
d['Education-num'] = int(d['Education-num'])
d['Capital-gain'] = int(d['Capital-gain'])
d['Capital-loss'] = int(d['Capital-loss'])
d['Hours-per-week'] = int(d['Hours-per-week'])
dataset.append(d)

这是我的输出：

 Output
 ---------------------------------------------------------------------------
 IndexError                                Traceback (most recent call last)
 <ipython-input-6-a6f851085aed> in <module>
      12     question_mark = True
      13     for i in range(len(header)):
 ---> 14         if(line[i] == ' ?'):
      15             question_mark = False
      16     if(question_mark):

 IndexError: list index out of range

Answer 1

线条

for i in range(len(header)):
  if (line[i] == ' ?'):

例如，如果文件末尾有一个空行，或者某行不包含预期数量的单元格，则会引发索引错误。

您可以通过直接迭代行来解决这个问题，而不是通过索引访问项目（某些人可能认为这种方式很糟糕）。

for cell in line:
    if cell == ' ?':
        ...

正如 Furas 在中指出的那样，代码可以进一步简化为

question_mark = (' ?' in line)

从 csv 中删除不需要的行时出现 IndexError

IndexError while removing unwanted rows from csv

python

csv

dataset

data-cleaning

index-error