如何读取 csv
How to read csv
我有一个数据存储在 csv 文件中,格式如下
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
每个列的数据类型
1. int 6. int
2. int 7. int
3. String 8. float
4. String 9. float
5. float 10.String
11.String
以 892, 893, ... 897 开头的第一列应以 int
格式存储在 array
中。第三列像 "Wilkes, Mrs. James (Ellen Needs)" 应该存储在 string
类型中。但是,第三列是 string
格式,但字符长度 不是 固定的,即我不知道此列中存储的最大字符长度
我已经完成了:
csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object.next()
data=[]
for row in csv_file_object:
data.append(row)
data = np.array(data)
但是,上面的代码将所有列读取为string
,但其中许多不是string
格式,并以string
格式存储信息。另一方面,如果我使用 genfromtxt
,第三列就是问题,因为它在双配额内包含逗号。
我希望用自己的数据类型存储每一列,即第一列应存储为 int
类型。
我期望的数组:
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
如您所见,如果数据不可用,则应输入 NaN
或其派生词。
我应该读取什么 csv 文件?
我不确定我是否理解你的意思,但我认为这对你有用。
我实现了另外两个函数来决定字符串是浮点数还是整数。
如果字符串是我写的空字符串None,不过,您可以将其更改为您喜欢的任何内容。
import csv
import numpy as np
def isfloat(x):
try:
a = float(x)
except ValueError:
return False
else:
return True
def isint(x):
try:
a = float(x)
b = int(a)
except ValueError:
return False
else:
return a == b
csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object
data=[]
for row in csv_file_object:
for index, cell in enumerate(row):
if isint(cell):
row[index] = int(cell)
elif isfloat(cell):
row[index] = float(cell)
if not cell: # cell == ''
row[index] = None # you can change the value to whatever you like.
data.append(row)
print data
输出:
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
[895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]
您可以更轻松地使用 pandas 库,如下所示:
import pandas as pd
df = pd.read_csv("trainData.csv", dtype={'col1': int, 'col2': int, 'col3': str, 'col4': str, 'col5': float, 'col6':int,
'col7': int, 'col8': float, 'col9':float, 'col10': str, 'col11': str})
df = map(list, df.values)
print df
输出:
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
[895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]
csv 文件应如下所示,因为第一行是列名
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
您可以在此处阅读有关 pandas 的更多信息 http://pandas.pydata.org/pandas-docs/stable/tutorials.html
我假设您使用的是 pandas,因为问题被标记为 pandas。像这样阅读文件:
df = pd.read_csv('test.txt', skiprows=0, index_col=0,
names='city_type name sex weight has_cat has_dog bank_balance body_fat_index car_mileage car_type'.split())
你会得到这样的数据框:
我冒昧地为列命名。
将数据读入数据框后,您可以使用它来施展各种魔法 - 看看 pandas 教程(它们很棒)。这是一个例子
df.bank_balance.describe()
count 6.000000
mean 726408.166667
std 1170522.652019
min 7538.000000
25% 258995.500000
50% 323032.500000
75% 355181.750000
max 3101298.000000
Name: bank_balance, dtype: float64
我有一个数据存储在 csv 文件中,格式如下
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
每个列的数据类型
1. int 6. int
2. int 7. int
3. String 8. float
4. String 9. float
5. float 10.String
11.String
以 892, 893, ... 897 开头的第一列应以 int
格式存储在 array
中。第三列像 "Wilkes, Mrs. James (Ellen Needs)" 应该存储在 string
类型中。但是,第三列是 string
格式,但字符长度 不是 固定的,即我不知道此列中存储的最大字符长度
我已经完成了:
csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object.next()
data=[]
for row in csv_file_object:
data.append(row)
data = np.array(data)
但是,上面的代码将所有列读取为string
,但其中许多不是string
格式,并以string
格式存储信息。另一方面,如果我使用 genfromtxt
,第三列就是问题,因为它在双配额内包含逗号。
我希望用自己的数据类型存储每一列,即第一列应存储为 int
类型。
我期望的数组:
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
如您所见,如果数据不可用,则应输入 NaN
或其派生词。
我应该读取什么 csv 文件?
我不确定我是否理解你的意思,但我认为这对你有用。
我实现了另外两个函数来决定字符串是浮点数还是整数。
如果字符串是我写的空字符串None,不过,您可以将其更改为您喜欢的任何内容。
import csv
import numpy as np
def isfloat(x):
try:
a = float(x)
except ValueError:
return False
else:
return True
def isint(x):
try:
a = float(x)
b = int(a)
except ValueError:
return False
else:
return a == b
csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object
data=[]
for row in csv_file_object:
for index, cell in enumerate(row):
if isint(cell):
row[index] = int(cell)
elif isfloat(cell):
row[index] = float(cell)
if not cell: # cell == ''
row[index] = None # you can change the value to whatever you like.
data.append(row)
print data
输出:
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
[895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]
您可以更轻松地使用 pandas 库,如下所示:
import pandas as pd
df = pd.read_csv("trainData.csv", dtype={'col1': int, 'col2': int, 'col3': str, 'col4': str, 'col5': float, 'col6':int,
'col7': int, 'col8': float, 'col9':float, 'col10': str, 'col11': str})
df = map(list, df.values)
print df
输出:
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
[895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]
csv 文件应如下所示,因为第一行是列名
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
您可以在此处阅读有关 pandas 的更多信息 http://pandas.pydata.org/pandas-docs/stable/tutorials.html
我假设您使用的是 pandas,因为问题被标记为 pandas。像这样阅读文件:
df = pd.read_csv('test.txt', skiprows=0, index_col=0,
names='city_type name sex weight has_cat has_dog bank_balance body_fat_index car_mileage car_type'.split())
你会得到这样的数据框:
我冒昧地为列命名。
将数据读入数据框后,您可以使用它来施展各种魔法 - 看看 pandas 教程(它们很棒)。这是一个例子
df.bank_balance.describe()
count 6.000000
mean 726408.166667
std 1170522.652019
min 7538.000000
25% 258995.500000
50% 323032.500000
75% 355181.750000
max 3101298.000000
Name: bank_balance, dtype: float64