如何读取 csv

How to read csv

我有一个数据存储在 csv 文件中,格式如下

892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S

每个列的数据类型

1. int        6. int
2. int        7. int
3. String     8. float
4. String     9. float
5. float      10.String
              11.String

892, 893, ... 897 开头的第一列应以 int 格式存储在 array 中。第三列像 "Wilkes, Mrs. James (Ellen Needs)" 应该存储在 string 类型中。但是,第三列是 string 格式,但字符长度 不是 固定的,即我不知道此列中存储的最大字符长度

我已经完成了:

 csv_file_object = csv.reader(open('trainData.csv', 'rb'))
 header = csv_file_object.next()

 data=[]
 for row in csv_file_object:
    data.append(row)
    data = np.array(data)

但是,上面的代码将所有列读取为string,但其中许多不是string格式,并以string格式存储信息。另一方面,如果我使用 genfromtxt,第三列就是问题,因为它在双配额内包含逗号。

我希望用自己的数据类型存储每一列,即第一列应存储为 int 类型。

我期望的数组:

892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S

如您所见,如果数据不可用,则应输入 NaN 或其派生词。

我应该读取什么 csv 文件?

我不确定我是否理解你的意思,但我认为这对你有用。

我实现了另外两个函数来决定字符串是浮点数还是整数。

如果字符串是我写的空字符串None,不过,您可以将其更改为您喜欢的任何内容。

import csv
import numpy as np

def isfloat(x):
    try:
        a = float(x)
    except ValueError:
        return False
    else:
        return True

def isint(x):
    try:
        a = float(x)
        b = int(a)
    except ValueError:
        return False
    else:
        return a == b


csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object

data=[]
for row in csv_file_object:
    for index, cell in enumerate(row):
        if isint(cell):
            row[index] = int(cell)
        elif isfloat(cell):
            row[index] = float(cell)
        if not cell: # cell == ''
            row[index] = None  # you can change the value to whatever you like.
    data.append(row)

print data

输出:

[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]

您可以更轻松地使用 pandas 库,如下所示:

import pandas as pd

df = pd.read_csv("trainData.csv", dtype={'col1': int, 'col2': int, 'col3': str, 'col4': str, 'col5': float, 'col6':int,
                                  'col7': int, 'col8': float, 'col9':float, 'col10': str, 'col11': str})
df = map(list, df.values)
print df

输出:

[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]

csv 文件应如下所示,因为第一行是列名

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S

您可以在此处阅读有关 pandas 的更多信息 http://pandas.pydata.org/pandas-docs/stable/tutorials.html

我假设您使用的是 pandas,因为问题被标记为 pandas。像这样阅读文件:

df = pd.read_csv('test.txt', skiprows=0, index_col=0, 
            names='city_type name sex weight has_cat has_dog bank_balance body_fat_index car_mileage car_type'.split())

你会得到这样的数据框:

我冒昧地为列命名。

将数据读入数据框后,您可以使用它来施展各种魔法 - 看看 pandas 教程(它们很棒)。这是一个例子

df.bank_balance.describe()

count          6.000000
mean      726408.166667
std      1170522.652019
min         7538.000000
25%       258995.500000
50%       323032.500000
75%       355181.750000
max      3101298.000000
Name: bank_balance, dtype: float64