如何在多个文件中保存多个输出,其中每个文件的标题来自 python 中的 object?
How to save multiple output in multiple file where each file has a different title coming from an object in python?
我正在从网站 (http://www.gfrvitale.altervista.org/index.php/autismo-in?format=feed&type=rss) 抓取 RSS 提要。
我写了一个脚本来从每个提要中提取和净化文本。我的主要问题是将每个项目的每个文本保存在不同的文件中,我还需要用项目的正确标题摘录来命名每个文件。
我的代码是:
for item in myFeed["items"]:
time_structure=item["published_parsed"]
dt = datetime.fromtimestamp(mktime(time_structure))
if dt>t:
link=item["link"]
response= requests.get(link)
doc=Document(response.text)
doc.summary(html_partial=False)
# extracting text
h = html2text.HTML2Text()
# converting
h.ignore_links = True #ignoro i link
h.skip_internal_links=True #ignoro i link esterni
h.inline_links=True
h.ignore_images=True #ignoro i link alle immagini
h.ignore_emphasis=True
h.ignore_anchors=True
h.ignore_tables=True
testo= h.handle(doc.summary()) #testo estratto
s = doc.title()+"."+" "+testo #contenuto da stampare nel file finale
tit=item["title"]
# save each file with it's proper title
with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
f.write(s)
f.close()
错误是:
File "<ipython-input-57-cd683dec157f>", line 34 with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
^
SyntaxError: invalid syntax
%tit
后需要加逗号
应该是:
#save each file with it's proper title
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
但是,如果您的文件名包含无效字符,它将 return 出错(即 [Errno 22]
)
您可以试试这个代码:
...
tit = item["title"]
tit = tit.replace(' ', '').replace("'", "").replace('?', '') # Not the best way, but it could help for now (will be better to create a list of stop characters)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
使用nltk
的其他方式:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tit = item["title"]
tit = tokenizer.tokenize(tit)
tit = ''.join(tit)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
首先,你放错了逗号,应该在 %tit
之后,而不是之前。
其次,您不需要关闭文件,因为您使用的 with
语句会自动为您关闭。编解码器是从哪里来的?我在其他任何地方都看不到它....无论如何,正确的 with
语句应该是:
with open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
我正在从网站 (http://www.gfrvitale.altervista.org/index.php/autismo-in?format=feed&type=rss) 抓取 RSS 提要。 我写了一个脚本来从每个提要中提取和净化文本。我的主要问题是将每个项目的每个文本保存在不同的文件中,我还需要用项目的正确标题摘录来命名每个文件。 我的代码是:
for item in myFeed["items"]:
time_structure=item["published_parsed"]
dt = datetime.fromtimestamp(mktime(time_structure))
if dt>t:
link=item["link"]
response= requests.get(link)
doc=Document(response.text)
doc.summary(html_partial=False)
# extracting text
h = html2text.HTML2Text()
# converting
h.ignore_links = True #ignoro i link
h.skip_internal_links=True #ignoro i link esterni
h.inline_links=True
h.ignore_images=True #ignoro i link alle immagini
h.ignore_emphasis=True
h.ignore_anchors=True
h.ignore_tables=True
testo= h.handle(doc.summary()) #testo estratto
s = doc.title()+"."+" "+testo #contenuto da stampare nel file finale
tit=item["title"]
# save each file with it's proper title
with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
f.write(s)
f.close()
错误是:
File "<ipython-input-57-cd683dec157f>", line 34 with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
^
SyntaxError: invalid syntax
%tit
应该是:
#save each file with it's proper title
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
但是,如果您的文件名包含无效字符,它将 return 出错(即 [Errno 22]
)
您可以试试这个代码:
...
tit = item["title"]
tit = tit.replace(' ', '').replace("'", "").replace('?', '') # Not the best way, but it could help for now (will be better to create a list of stop characters)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
使用nltk
的其他方式:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tit = item["title"]
tit = tokenizer.tokenize(tit)
tit = ''.join(tit)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
首先,你放错了逗号,应该在 %tit
之后,而不是之前。
其次,您不需要关闭文件,因为您使用的 with
语句会自动为您关闭。编解码器是从哪里来的?我在其他任何地方都看不到它....无论如何,正确的 with
语句应该是:
with open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)