将 csv 导入 pandas 数据框时不读取所有行
Not reading all rows while importing csv into pandas dataframe
我正在尝试 kaggle 挑战 here,不幸的是,我卡在了一个非常基本的步骤。这都怪我 python 知识有限。
我正在尝试通过执行以下命令将 datasets 读入 pandas 数据帧:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
问题是你会发现这个文件有超过 300,000 条记录,但我只读取 7945、21。
print (test.shape)
(7945, 21)
现在我已经仔细检查了文件,但我找不到关于行号 7945 的任何特殊之处。任何可能发生这种情况的指针。看起来很普通的情况,希望有运行遇到这个错误的朋友能帮帮我。
我认为使用函数更好 read_csv with parameters quoting=csv.QUOTE_NONE
and error_bad_lines=False
. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
但是一些数据(有问题的)将被跳过。
如果你想跳过邮件正文数据,你可以使用:
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, sep=',', error_bad_lines=False, header=None,
names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']
我正在尝试 kaggle 挑战 here,不幸的是,我卡在了一个非常基本的步骤。这都怪我 python 知识有限。 我正在尝试通过执行以下命令将 datasets 读入 pandas 数据帧:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
问题是你会发现这个文件有超过 300,000 条记录,但我只读取 7945、21。
print (test.shape)
(7945, 21)
现在我已经仔细检查了文件,但我找不到关于行号 7945 的任何特殊之处。任何可能发生这种情况的指针。看起来很普通的情况,希望有运行遇到这个错误的朋友能帮帮我。
我认为使用函数更好 read_csv with parameters quoting=csv.QUOTE_NONE
and error_bad_lines=False
. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
但是一些数据(有问题的)将被跳过。
如果你想跳过邮件正文数据,你可以使用:
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, sep=',', error_bad_lines=False, header=None,
names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']