Python 数据整理问题
Python data wrangling issues
我目前对小型数据集的一些基本问题感到困惑。下面是前三行来说明数据的格式:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00", 35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00", 200
我在使用 read_csv 创建 DataFrame 后遇到的问题:
某些分类值(例如 Prize_Pool)中存在逗号会导致 python 将这些条目视为字符串。我需要将它们转换为浮点数才能进行某些计算。我已经使用 python 的 replace() 函数来删除逗号,但这是我所能得到的。
类别Contest_Date_EST包含时间戳,但有些是重复的。我想将整个数据集子集化为一个只有唯一时间戳的数据集。如果能够选择删除重复条目或删除条目会很好,但目前我只想能够使用唯一时间戳过滤数据。
对包含逗号的数字使用 thousands=','
参数
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
你可以检查Prize_Pool是数字
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
要删除行-取第一个观察到的,也可以取最后一个
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
看来 itertools.groupby()
是完成这项工作的工具;
是这样的吗?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
输出:
更多信息:
https://docs.python.org/2/library/itertools.html
希望这对您有所帮助:)
我目前对小型数据集的一些基本问题感到困惑。下面是前三行来说明数据的格式:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00", 35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00", 200
我在使用 read_csv 创建 DataFrame 后遇到的问题:
某些分类值(例如 Prize_Pool)中存在逗号会导致 python 将这些条目视为字符串。我需要将它们转换为浮点数才能进行某些计算。我已经使用 python 的 replace() 函数来删除逗号,但这是我所能得到的。
类别Contest_Date_EST包含时间戳,但有些是重复的。我想将整个数据集子集化为一个只有唯一时间戳的数据集。如果能够选择删除重复条目或删除条目会很好,但目前我只想能够使用唯一时间戳过滤数据。
对包含逗号的数字使用 thousands=','
参数
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
你可以检查Prize_Pool是数字
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
要删除行-取第一个观察到的,也可以取最后一个
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that. I'll leave this here for now in case it's applicable but if it gets downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
看来 itertools.groupby()
是完成这项工作的工具;
是这样的吗?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
输出:
更多信息:
https://docs.python.org/2/library/itertools.html
希望这对您有所帮助:)