解析 txt 文件,从每行的每个 link 中抓取图像,其中 python
Parsing txt file, to web scrape an image from each link on each line, with python
我试图打开一个 txt 文件,每行都有一个 http link,然后让 python 转到每个 link,找到一个特定的图像,然后在 txt 文件中列出的每个页面打印出直接 link 到该图像。
但是,我不知道自己在做什么。 (开始python几天前)
这是我当前的代码,它不起作用...
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
更新 1:
好的,这是我需要的更具体一点。我有一个脚本可以将很多 link 打印到一个 txt 文件中,每个 link 在它自己的行上。即
http://link.com/1
http://link.com/2
等等
等等
我现在想要完成的是打开该文本文件,其中包含那些 link 和 运行 我已经发布的正则表达式,然后打印image links,它将在 link.com/1 等中找到另一个文本文件,该文件应该类似于
http://link.com/1/image.jpg
http://link.com/2/image.jpg
等等
之后,我不需要任何帮助,因为我已经有了一个 python 脚本,可以从那个 txt 文件下载图像。
更新2:基本上我需要的就是这个脚本
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
但它不会在url变量中寻找特定的url,而是会抓取我指定的文本文件中的所有url,然后打印出结果.
我建议你使用Scrapy spider
这是一个例子
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
urllist =[]
with open("URLFilename") as f:
for line in f:
urllist.append(line)
class YourScrapingSpider(XMLFeedSpider):
name = "imagespider"
allowed_domains = []
url = NextURL()
start_urls = []
def start_requests(self):
start_url = self.url.next()
request = Request(start_url, dont_filter=True)
yield request
def parse(self, response, node):
scraped_item = Item()
yield scraped_item
next_url = self.url.next()
yield Request(next_url)
我正在创建蜘蛛,同时将从文件中读取 URL 并发出请求并下载图像。
为此我们必须使用 ImagesPipeline
起步阶段会比较困难,但我建议你学习一下Scrapy。 Scrapy 是 Python.
中的一个网络爬虫框架
更新:
import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
print(soup)
for tag in soup.findAll('img'):
print (tag)
# process(url)
def main():
url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
process(url)
if __name__ == "__main__":
main()
o/p
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"
更新二:
with open(the_filename, 'w') as f:
for s in image_links:
f.write(s + '\n')
我试图打开一个 txt 文件,每行都有一个 http link,然后让 python 转到每个 link,找到一个特定的图像,然后在 txt 文件中列出的每个页面打印出直接 link 到该图像。
但是,我不知道自己在做什么。 (开始python几天前)
这是我当前的代码,它不起作用...
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
更新 1:
好的,这是我需要的更具体一点。我有一个脚本可以将很多 link 打印到一个 txt 文件中,每个 link 在它自己的行上。即
http://link.com/1
http://link.com/2
等等
等等
我现在想要完成的是打开该文本文件,其中包含那些 link 和 运行 我已经发布的正则表达式,然后打印image links,它将在 link.com/1 等中找到另一个文本文件,该文件应该类似于
http://link.com/1/image.jpg
http://link.com/2/image.jpg
等等
之后,我不需要任何帮助,因为我已经有了一个 python 脚本,可以从那个 txt 文件下载图像。
更新2:基本上我需要的就是这个脚本
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
但它不会在url变量中寻找特定的url,而是会抓取我指定的文本文件中的所有url,然后打印出结果.
我建议你使用Scrapy spider
这是一个例子
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
urllist =[]
with open("URLFilename") as f:
for line in f:
urllist.append(line)
class YourScrapingSpider(XMLFeedSpider):
name = "imagespider"
allowed_domains = []
url = NextURL()
start_urls = []
def start_requests(self):
start_url = self.url.next()
request = Request(start_url, dont_filter=True)
yield request
def parse(self, response, node):
scraped_item = Item()
yield scraped_item
next_url = self.url.next()
yield Request(next_url)
我正在创建蜘蛛,同时将从文件中读取 URL 并发出请求并下载图像。
为此我们必须使用 ImagesPipeline
起步阶段会比较困难,但我建议你学习一下Scrapy。 Scrapy 是 Python.
中的一个网络爬虫框架更新:
import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
print(soup)
for tag in soup.findAll('img'):
print (tag)
# process(url)
def main():
url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
process(url)
if __name__ == "__main__":
main()
o/p
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"
更新二:
with open(the_filename, 'w') as f:
for s in image_links:
f.write(s + '\n')