使用 Pandas 解析大型 txt 文件时出现 ParserError

ParserError while parsing a large txt file with Pandas

我正在尝试使用 Pandas 解析大型 .txt 文件。该文件的大小为 1.6 GB。您可以下载文件 here(它是所有国家和定居点的 GeoNames 数据库转储)。

关于Pandas中文件的加载和解析,我查阅了答案 and here,这是我的代码:

import pandas as pd

for chunk in pd.read_csv(
    "allCountries.txt",
    header=None,
    engine="python",
    sep=r"\s{1,}",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    chunksize=1000,
):
    print(chunk[0])  # just printing out the first row

如果我运行上面的代码,我得到以下错误:

ParserError: Expected 20 fields in line 1, saw 25. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

我不知道这里出了什么问题。 谁能告诉我哪里出了问题,我该如何解决?

您的分隔符错误,因为您在一列(名称)中有空格:

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 0 2860 Europe/Andorra 2014-11-05

解析错误。

这段代码对我有用:

for chunk in pd.read_csv(
    "allCountries.txt",
    header=None,
    engine="python",
    sep=r"\t+",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    chunksize=1000,
):
    print(chunk)

使用 LibreOffice 打开文件的前 10 行并使用制表符作为分隔符效果很好

import csv
import pandas as pd

for chunk in pd.read_csv(
    'allCountries.txt',
    header=None,
    engine="python",
    sep="\t",
    names=[
        "geonameid",
        "name",
        "asciiname",
        "alternatenames",
        "latitude",
        "longitude",
        "feature class",
        "feature code",
        "country code",
        "cc2",
        "admin1 code",
        "admin2 code",
        "admin3 code",
        "admin4 code",
        "population",
        "elevation",
        "dem",
        "timezone",
        "modification date",
    ],
    quoting=csv.QUOTE_NONE,
    chunksize=1000
):
    print(chunk.iloc[0])  # just printing out the first row

该文件还包含字符 ' 和 ",pandas 默认情况下假定它们用于引用并导致错误,但将引用设置为 QUOTE_NONE 修复了它。