使用 Pandas 解析大型 txt 文件时出现 ParserError
ParserError while parsing a large txt file with Pandas
我正在尝试使用 Pandas 解析大型 .txt 文件。该文件的大小为 1.6 GB。您可以下载文件 here(它是所有国家和定居点的 GeoNames 数据库转储)。
关于Pandas中文件的加载和解析,我查阅了答案 and here,这是我的代码:
import pandas as pd
for chunk in pd.read_csv(
"allCountries.txt",
header=None,
engine="python",
sep=r"\s{1,}",
names=[
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
],
chunksize=1000,
):
print(chunk[0]) # just printing out the first row
如果我运行上面的代码,我得到以下错误:
ParserError: Expected 20 fields in line 1, saw 25. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
我不知道这里出了什么问题。 谁能告诉我哪里出了问题,我该如何解决?
您的分隔符错误,因为您在一列(名称)中有空格:
2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 0 2860 Europe/Andorra 2014-11-05
解析错误。
这段代码对我有用:
for chunk in pd.read_csv(
"allCountries.txt",
header=None,
engine="python",
sep=r"\t+",
names=[
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
],
chunksize=1000,
):
print(chunk)
使用 LibreOffice 打开文件的前 10 行并使用制表符作为分隔符效果很好
import csv
import pandas as pd
for chunk in pd.read_csv(
'allCountries.txt',
header=None,
engine="python",
sep="\t",
names=[
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
],
quoting=csv.QUOTE_NONE,
chunksize=1000
):
print(chunk.iloc[0]) # just printing out the first row
该文件还包含字符 ' 和 ",pandas 默认情况下假定它们用于引用并导致错误,但将引用设置为 QUOTE_NONE 修复了它。
我正在尝试使用 Pandas 解析大型 .txt 文件。该文件的大小为 1.6 GB。您可以下载文件 here(它是所有国家和定居点的 GeoNames 数据库转储)。
关于Pandas中文件的加载和解析,我查阅了答案
import pandas as pd
for chunk in pd.read_csv(
"allCountries.txt",
header=None,
engine="python",
sep=r"\s{1,}",
names=[
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
],
chunksize=1000,
):
print(chunk[0]) # just printing out the first row
如果我运行上面的代码,我得到以下错误:
ParserError: Expected 20 fields in line 1, saw 25. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
我不知道这里出了什么问题。 谁能告诉我哪里出了问题,我该如何解决?
您的分隔符错误,因为您在一列(名称)中有空格:
2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 0 2860 Europe/Andorra 2014-11-05
解析错误。
这段代码对我有用:
for chunk in pd.read_csv(
"allCountries.txt",
header=None,
engine="python",
sep=r"\t+",
names=[
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
],
chunksize=1000,
):
print(chunk)
使用 LibreOffice 打开文件的前 10 行并使用制表符作为分隔符效果很好
import csv
import pandas as pd
for chunk in pd.read_csv(
'allCountries.txt',
header=None,
engine="python",
sep="\t",
names=[
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
],
quoting=csv.QUOTE_NONE,
chunksize=1000
):
print(chunk.iloc[0]) # just printing out the first row
该文件还包含字符 ' 和 ",pandas 默认情况下假定它们用于引用并导致错误,但将引用设置为 QUOTE_NONE 修复了它。