第一列中 Pandas 数据框的所有列值,如何提取以更正列

All of columnar values of Pandas dataframe in first column, how to extract to correct columns

我正在处理一个包含多个列的混乱 csv。一些行在第一列中具有所有列值,如下所示:


    City Edition Sport Discipline Athlete   NOC Gender  Event Event_gender  Medal
330 Paris,1900,Cricket,Cricket,"ROQUES, F.",FRA,Men,cricket,M,Silver    NaN NaN NaN NaN NaN NaN NaN NaN NaN
331 Paris,1900,Cricket,Cricket,"SCHNEIDAU, A.J.",FRA,Men,cricket,M,Silver   NaN NaN NaN NaN NaN NaN NaN NaN NaN
332 Paris,1900,Cricket,Cricket,"TERRY, Henry John",FRA,Men,cricket,M,Silver NaN NaN NaN NaN NaN NaN NaN NaN NaN
333 Paris,1900,Cricket,Cricket,"TOMALIN, P.H.",FRA,Men,cricket,M,Silver NaN NaN NaN NaN NaN NaN NaN NaN NaN
334 Paris   1900.0  Croquet Croquet AUMOITTE    FRA Men double  M   Gold

前四行包含 City 列下的所有值,而最后一行在各个列中包含正确的值。有几十万行,几乎所有行都有列问题。我必须按原样保留所有具有正确值的行。

编辑

我的csv_read没问题,是文件本身的问题。导致问题的行在引号内,运动员姓名在引号内,姓氏和名字用撇号隔开。所以我想最好的办法是创建一个函数来打开将去除多余字符的文件。虽然这可能很难在不使用大量内存的情况下实现,因为有 30 万行...

尝试使用 usecolspadas.read_csv

 pd.read_csv(data, usecols=['City', 'Edition', 'Sport', 'Discipline', 'Athlete','NOC','Gender','Event','Event_gender','Medal'])

您的数据需要一个复杂的正则表达式模式进行处理,如下所示:

# conda install -c conda-forge regex
import regex as re
from io import StringIO
import pandas as pd
if __name__ == '__main__':
    input_path = "data/mixed_csv.csv"
    # python re does not support for a variable-width lookbehind 
    pat = re.compile(r'\s+(?=(?:"[^"]*?(?: [^"]*)*))|\s+(?=[^",]+(?:,|$))|,(?=(?:"[^"]*?(?: [^"]*)*))|,(?=[^",]+(?:,|$))')
    refined_lines = ""
    with open(input_path, "r") as fin:
        for line in fin:
            tokens = pat.split(line)
            refined_lines += ",".join(tokens)
    df = pd.read_csv(StringIO(refined_lines), sep=",", index_col=0)
    print(df)

基本上,您需要了解前瞻、后视正则表达式模式。

  1. regex1(?=(regex2)) : Positive Lookahead : 匹配正则表达式1,然后匹配正则表达式2
  2. regex1(?!(regex2)) : Negative Lookahead : 匹配 regex1, 然后 regex2 不匹配
  3. (?<=(regex2))regex1 : Positive Lookbehind : 匹配正则表达式2,然后匹配正则表达式1
  4. (?<!(regex2))regex1 : Negative Lookbehind : regex2 不匹配,然后 regex1 匹配

结果:

      City  Edition    Sport Discipline            Athlete  NOC Gender    Event Event_gender   Medal   0   1   2   3   4   5   6   7   8   9
id                                                                                                                                          
330  Paris   1900.0  Cricket    Cricket         ROQUES, F.  FRA    Men  cricket            M  Silver NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
331  Paris   1900.0  Cricket    Cricket    SCHNEIDAU, A.J.  FRA    Men  cricket            M  Silver NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
332  Paris   1900.0  Cricket    Cricket  TERRY, Henry John  FRA    Men  cricket            M  Silver NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
333  Paris   1900.0  Cricket    Cricket      TOMALIN, P.H.  FRA    Men  cricket            M  Silver NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
334  Paris   1900.0  Croquet    Croquet           AUMOITTE  FRA    Men   double            M    Gold NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

备注:

  1. 您应该在第一行添加一些列名称(idNaNs
  2. 如果内存不足,您可以每行处理每个 tokens 个对象,而不是创建数据框。