没有这样的文件或目录错误 (python)
No such file or directory error (python)
在我的代码中,我将 txt 文件的路径设置为脚本路径,但由于某些原因,在程序为一些链接写入一些 txt 文件后,它会抛出此错误 "FileNotFoundError: [Errno 2] No such file or directory:"我不太明白为什么有些链接有效,但有些链接似乎找不到目录。
from lxml import html
import requests, os.path
spath = os.path.dirname(__file__) ## finds path of script
main_pg = requests.get("http://www.nytimes.com/") ## input site here
with open(os.path.join(spath, "Main.txt"),"w", encoding='utf-8') as doc:
doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[starts-with(@href, "http:") or starts-with(@href,"https:") or starts-with(@href,"ftp:")]/@href') ## To avoid non-absolute hrefs
for href in hrefs:
link_pg = requests.get(href)
tree2 = html.fromstring(link_pg.content)
doc_title = tree2.xpath('//html/head/title/text()') ## selects title of text from each link
with open(os.path.join(spath, "%s.txt"%doc_title), "w", encoding ='utf-8') as href_doc:
href_doc.write(link_pg.text)
我发现存在几个错误 - 顺便说一句,您需要先清理文件名,然后再将其用作名称。 doc_title
return 一个列表,因此文件名无效,因此使用 join
函数从列表中获取字符串。从列表中获取字符串后,从中删除无效的文件名字符并用作文件名。
尝试以下 (python 2.7)-
import os,sys,codecs
from lxml import html
import requests, os.path,re
spath = os.path.dirname(__file__) ## finds path of script
#spath = os.path.dirname(sys.argv[0])## or use this
main_pg = requests.get("http://www.nytimes.com/") ## input site here
with codecs.open(os.path.join(spath, "Main.txt"),"w", encoding='utf-8') as doc:
doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[starts-with(@href, "http:") or starts-with(@href,"https:") or starts-with(@href,"ftp:")]/@href') ## To avoid non-absolute hrefs
for href in hrefs:
link_pg = requests.get(href)
tree2 = html.fromstring(link_pg.content)
doc_title = tree2.xpath('//html/head/title/text()') ## selects title of text from each link
# Now remove invalid characters from the file name - for invalid chars see https://en.wikipedia.org/wiki/Filename#Reserved%5Fcharacters%5Fand%5Fwords
file_name = re.sub(ur'(\?|\|\?|\%|\*|:\||"|<|>)',ur'',''.join(doc_title))
with codecs.open(os.path.join(spath, "%s.txt"%file_name), "w", encoding ='utf-8') as href_doc:
href_doc.write(link_pg.text)
我刚刚使用 regex
删除无效的文件名字符,您可以使用 replace
函数 - 有关我使用的正则表达式的详细信息,请参阅 - LIVE DEMO
在我的代码中,我将 txt 文件的路径设置为脚本路径,但由于某些原因,在程序为一些链接写入一些 txt 文件后,它会抛出此错误 "FileNotFoundError: [Errno 2] No such file or directory:"我不太明白为什么有些链接有效,但有些链接似乎找不到目录。
from lxml import html
import requests, os.path
spath = os.path.dirname(__file__) ## finds path of script
main_pg = requests.get("http://www.nytimes.com/") ## input site here
with open(os.path.join(spath, "Main.txt"),"w", encoding='utf-8') as doc:
doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[starts-with(@href, "http:") or starts-with(@href,"https:") or starts-with(@href,"ftp:")]/@href') ## To avoid non-absolute hrefs
for href in hrefs:
link_pg = requests.get(href)
tree2 = html.fromstring(link_pg.content)
doc_title = tree2.xpath('//html/head/title/text()') ## selects title of text from each link
with open(os.path.join(spath, "%s.txt"%doc_title), "w", encoding ='utf-8') as href_doc:
href_doc.write(link_pg.text)
我发现存在几个错误 - 顺便说一句,您需要先清理文件名,然后再将其用作名称。 doc_title
return 一个列表,因此文件名无效,因此使用 join
函数从列表中获取字符串。从列表中获取字符串后,从中删除无效的文件名字符并用作文件名。
尝试以下 (python 2.7)-
import os,sys,codecs
from lxml import html
import requests, os.path,re
spath = os.path.dirname(__file__) ## finds path of script
#spath = os.path.dirname(sys.argv[0])## or use this
main_pg = requests.get("http://www.nytimes.com/") ## input site here
with codecs.open(os.path.join(spath, "Main.txt"),"w", encoding='utf-8') as doc:
doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[starts-with(@href, "http:") or starts-with(@href,"https:") or starts-with(@href,"ftp:")]/@href') ## To avoid non-absolute hrefs
for href in hrefs:
link_pg = requests.get(href)
tree2 = html.fromstring(link_pg.content)
doc_title = tree2.xpath('//html/head/title/text()') ## selects title of text from each link
# Now remove invalid characters from the file name - for invalid chars see https://en.wikipedia.org/wiki/Filename#Reserved%5Fcharacters%5Fand%5Fwords
file_name = re.sub(ur'(\?|\|\?|\%|\*|:\||"|<|>)',ur'',''.join(doc_title))
with codecs.open(os.path.join(spath, "%s.txt"%file_name), "w", encoding ='utf-8') as href_doc:
href_doc.write(link_pg.text)
我刚刚使用 regex
删除无效的文件名字符,您可以使用 replace
函数 - 有关我使用的正则表达式的详细信息,请参阅 - LIVE DEMO