优化从 .gz 文件读取和 cpu 利用率 python
Optimize read from .gz file and cpu utilization python
示例输入文件:
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202677@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,INVITE,0,10.166.224.34,1518814163,[sip:1202687@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,INVITE,0,10.166.224.34,1518814163,[sip:1202677@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202678@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202687@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
示例输出文件:
1202677 REGISTER,INVITE
1202687 INVITE,REGISTER
1202678 REGISTER
代码示例:
filesList=glob.glob("%s/*.gz" %(sys.argv[1]))
for file in filesList:
try:
fp = gzip.open(file, 'rb')
f=fp.readlines()
fp.close()
for line in f:
line = line.split(',')
if line[0] == '83':
str=line[5].split("[sip:")
if len(str) > 1:
str=str[1].split("@")
if dict.has_key(str[0].strip()):
dict[str[0].strip()] = dict.get(str[0].strip())+','+line[1]
else:
dict[str[0].strip()] = line[1]
except:
print "Unexpected Error: ", sys.exc_info()[0]
try:
with open(sys.argv[2],'w') as s:
for num in dict:
print >> s, num,dict[num]
except:
print "Unexpected error:", sys.exc_info()[0]
当我 运行 上面的脚本 2.1GB(430 个文件) 加载然后执行大约需要 13 分钟并且 CPU 利用率约为 100 %.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12586 root 20 0 156m 134m 1808 R 99.8 0.2 0:40.17 script
请告诉我如何优化上述代码以减少执行时间。谢谢
尝试pandas
。如果这仍然太慢,可以使用工具,例如dask.dataframe
,这可以提高效率。
df = pd.concat([pd.read_csv(f, header=None, usecols=[1, 5]) for f in files])
df[5] = df[5].str.split(':|@').apply(lambda x: x[1])
result = df.groupby(5)[1].apply(list)
# 5
# 1202677 [REGISTER, INVITE]
# 1202678 [REGISTER]
# 1202687 [INVITE, REGISTER]
# Name: 1, dtype: object
示例输入文件:
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202677@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,INVITE,0,10.166.224.34,1518814163,[sip:1202687@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,INVITE,0,10.166.224.34,1518814163,[sip:1202677@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202678@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202687@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
示例输出文件:
1202677 REGISTER,INVITE
1202687 INVITE,REGISTER
1202678 REGISTER
代码示例:
filesList=glob.glob("%s/*.gz" %(sys.argv[1]))
for file in filesList:
try:
fp = gzip.open(file, 'rb')
f=fp.readlines()
fp.close()
for line in f:
line = line.split(',')
if line[0] == '83':
str=line[5].split("[sip:")
if len(str) > 1:
str=str[1].split("@")
if dict.has_key(str[0].strip()):
dict[str[0].strip()] = dict.get(str[0].strip())+','+line[1]
else:
dict[str[0].strip()] = line[1]
except:
print "Unexpected Error: ", sys.exc_info()[0]
try:
with open(sys.argv[2],'w') as s:
for num in dict:
print >> s, num,dict[num]
except:
print "Unexpected error:", sys.exc_info()[0]
当我 运行 上面的脚本 2.1GB(430 个文件) 加载然后执行大约需要 13 分钟并且 CPU 利用率约为 100 %.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12586 root 20 0 156m 134m 1808 R 99.8 0.2 0:40.17 script
请告诉我如何优化上述代码以减少执行时间。谢谢
尝试pandas
。如果这仍然太慢,可以使用工具,例如dask.dataframe
,这可以提高效率。
df = pd.concat([pd.read_csv(f, header=None, usecols=[1, 5]) for f in files])
df[5] = df[5].str.split(':|@').apply(lambda x: x[1])
result = df.groupby(5)[1].apply(list)
# 5
# 1202677 [REGISTER, INVITE]
# 1202678 [REGISTER]
# 1202687 [INVITE, REGISTER]
# Name: 1, dtype: object