使用 pandas 来处理大量的 csv 文件
use pandas to handle massive csv file
reading bulk CSV_FILE, i have no problem if the file has 5 millions
number of rows, but the problem if trying to run this code on massive
file around 300 million rows, but it doesn't work with me , is
there any way to enhance the code or chunk function that enhance the
response time
import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
,dtype='unicode'
,index_col=False
,error_bad_lines=False
,sep = ';'
,low_memory = False
,names =['DATE'
,'IMSI'
,'WEBSITE'
,'LINKUP'
,'LINKDOWN'
,'COUNT'
,'CONNECTION']
)
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
,'LINKUP':'sum'
, 'LINKDOWN':'sum'
, 'COUNT':'max'
,'CONNECTION':'sum'
})
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')
dask.dataframe
提供了一个解决方案,它在内部分块:
import dask.dataframe as dd
df = dd.read_csv(...)
group = df.groupby(...).aggregate({...}).compute()
group.to_csv('output.txt')
这还没有经过测试。我建议您阅读 documentation 以熟悉语法。要理解的重点是 dd.read_csv
不会读取内存中的整个文件,并且在调用 compute
之前不会处理任何操作,此时 dask
通过分块在常量内存中处理。
reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time
import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
,dtype='unicode'
,index_col=False
,error_bad_lines=False
,sep = ';'
,low_memory = False
,names =['DATE'
,'IMSI'
,'WEBSITE'
,'LINKUP'
,'LINKDOWN'
,'COUNT'
,'CONNECTION']
)
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
,'LINKUP':'sum'
, 'LINKDOWN':'sum'
, 'COUNT':'max'
,'CONNECTION':'sum'
})
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')
dask.dataframe
提供了一个解决方案,它在内部分块:
import dask.dataframe as dd
df = dd.read_csv(...)
group = df.groupby(...).aggregate({...}).compute()
group.to_csv('output.txt')
这还没有经过测试。我建议您阅读 documentation 以熟悉语法。要理解的重点是 dd.read_csv
不会读取内存中的整个文件,并且在调用 compute
之前不会处理任何操作,此时 dask
通过分块在常量内存中处理。