Python 脚本无误终止
Python script killed without error
我正在运行一个脚本,该脚本下载其中带有 html 标签的 xls 文件并将它们剥离以创建一个干净的 csv 文件。
代码:
#!/usr/bin/env python
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
#from pympler.asizeof import asizeof
from pympler import muppy
from pympler import summary
f = urlopen('http://localhost/Classes/sample.xls') #This is 75KB
#f = urlopen('http://supplier.com/xmlfeed/products.xls') #This is 75MB
soup = BeautifulSoup(f)
stable = soup.find('table')
print 'table found'
rows = []
for row in stable.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
#print sys.getsizeof(rows)
#print asizeof(rows)
print 'row list created'
soup.decompose()
print 'soup decomposed'
f.close()
print 'file closed'
with open('output_file.csv', 'wb') as file:
writer = csv.writer(file)
print 'writer started'
#writer.writerow(headers)
writer.writerows(row for row in rows if row)
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
以上代码对于 75KB 的文件完美运行,但是对于 75MB 的文件,进程被终止而没有任何错误。
我对beautiful soup和python很陌生,请帮我找出问题所在。该脚本在 3GB RAM 上运行。
小文件的输出是:
table found
row list created
soup decomposed
file closed
writer started
types | # objects | total size
===================================== | =========== | ============
dict | 5615 | 4.56 MB
str | 8457 | 713.23 KB
list | 3525 | 375.51 KB
<class 'bs4.element.NavigableString | 1810 | 335.76 KB
code | 1874 | 234.25 KB
<class 'bs4.element.Tag | 3097 | 193.56 KB
unicode | 3102 | 182.65 KB
type | 137 | 120.95 KB
wrapper_descriptor | 1060 | 82.81 KB
builtin_function_or_method | 718 | 50.48 KB
method_descriptor | 580 | 40.78 KB
weakref | 416 | 35.75 KB
set | 137 | 35.04 KB
tuple | 431 | 31.56 KB
<class 'abc.ABCMeta | 20 | 17.66 KB
我不明白什么是 "dict",75KB 的文件需要更多的内存。
谢谢,
如果没有实际的文件可以使用很难说,但是您可以做的是避免创建中间的行列表并直接写入打开的 csv
文件。
此外,您可以让 BeautifulSoup
在后台使用 lxml.html
(应该安装 lxml
)。
改进代码:
#!/usr/bin/env python
from urllib2 import urlopen
import csv
from bs4 import BeautifulSoup
f = urlopen('http://localhost/Classes/sample.xls')
soup = BeautifulSoup(f, 'lxml')
with open('output_file.csv', 'wb') as file:
writer = csv.writer(file)
for row in soup.select('table tr'):
writer.writerows(val.text.encode('utf8') for val in row.find_all('th') if val)
writer.writerows(val.text.encode('utf8') for val in row.find_all('td') if val)
soup.decompose()
f.close()
我正在运行一个脚本,该脚本下载其中带有 html 标签的 xls 文件并将它们剥离以创建一个干净的 csv 文件。
代码:
#!/usr/bin/env python
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
#from pympler.asizeof import asizeof
from pympler import muppy
from pympler import summary
f = urlopen('http://localhost/Classes/sample.xls') #This is 75KB
#f = urlopen('http://supplier.com/xmlfeed/products.xls') #This is 75MB
soup = BeautifulSoup(f)
stable = soup.find('table')
print 'table found'
rows = []
for row in stable.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
#print sys.getsizeof(rows)
#print asizeof(rows)
print 'row list created'
soup.decompose()
print 'soup decomposed'
f.close()
print 'file closed'
with open('output_file.csv', 'wb') as file:
writer = csv.writer(file)
print 'writer started'
#writer.writerow(headers)
writer.writerows(row for row in rows if row)
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
以上代码对于 75KB 的文件完美运行,但是对于 75MB 的文件,进程被终止而没有任何错误。
我对beautiful soup和python很陌生,请帮我找出问题所在。该脚本在 3GB RAM 上运行。
小文件的输出是:
table found
row list created
soup decomposed
file closed
writer started
types | # objects | total size
===================================== | =========== | ============
dict | 5615 | 4.56 MB
str | 8457 | 713.23 KB
list | 3525 | 375.51 KB
<class 'bs4.element.NavigableString | 1810 | 335.76 KB
code | 1874 | 234.25 KB
<class 'bs4.element.Tag | 3097 | 193.56 KB
unicode | 3102 | 182.65 KB
type | 137 | 120.95 KB
wrapper_descriptor | 1060 | 82.81 KB
builtin_function_or_method | 718 | 50.48 KB
method_descriptor | 580 | 40.78 KB
weakref | 416 | 35.75 KB
set | 137 | 35.04 KB
tuple | 431 | 31.56 KB
<class 'abc.ABCMeta | 20 | 17.66 KB
我不明白什么是 "dict",75KB 的文件需要更多的内存。
谢谢,
如果没有实际的文件可以使用很难说,但是您可以做的是避免创建中间的行列表并直接写入打开的 csv
文件。
此外,您可以让 BeautifulSoup
在后台使用 lxml.html
(应该安装 lxml
)。
改进代码:
#!/usr/bin/env python
from urllib2 import urlopen
import csv
from bs4 import BeautifulSoup
f = urlopen('http://localhost/Classes/sample.xls')
soup = BeautifulSoup(f, 'lxml')
with open('output_file.csv', 'wb') as file:
writer = csv.writer(file)
for row in soup.select('table tr'):
writer.writerows(val.text.encode('utf8') for val in row.find_all('th') if val)
writer.writerows(val.text.encode('utf8') for val in row.find_all('td') if val)
soup.decompose()
f.close()