解析 txt 文件,从每行的每个 link 中抓取图像,其中 python

Parsing txt file, to web scrape an image from each link on each line, with python

我试图打开一个 txt 文件,每行都有一个 http link,然后让 python 转到每个 link,找到一个特定的图像,然后在 txt 文件中列出的每个页面打印出直接 link 到该图像。

但是,我不知道自己在做什么。 (开始python几天前)

这是我当前的代码,它不起作用...

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links

更新 1:

好的,这是我需要的更具体一点。我有一个脚本可以将很多 link 打印到一个 txt 文件中,每个 link 在它自己的行上。即

http://link.com/1
http://link.com/2
等等
等等

我现在想要完成的是打开该文本文件,其中包含那些 link 和 运行 我已经发布的正则表达式,然后打印image links,它将在 link.com/1 等中找到另一个文本文件,该文件应该类似于

http://link.com/1/image.jpg
http://link.com/2/image.jpg

等等

之后,我不需要任何帮助,因为我已经有了一个 python 脚本,可以从那个 txt 文件下载图像。

更新2:基本上我需要的就是这个脚本

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)

print image_links

但它不会在url变量中寻找特定的url,而是会抓取我指定的文本文件中的所有url,然后打印出结果.

我建议你使用Scrapy spider

这是一个例子

from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider


def NextURL():
    urllist =[]
    with open("URLFilename") as f:
        for line in f:
            urllist.append(line)

class YourScrapingSpider(XMLFeedSpider):

    name = "imagespider"

    allowed_domains = []

    url = NextURL()

    start_urls = []

    def start_requests(self):

        start_url = self.url.next()

        request = Request(start_url, dont_filter=True)

        yield request


    def parse(self, response, node):

        scraped_item = Item()
        yield scraped_item
        next_url = self.url.next()
        yield Request(next_url)

我正在创建蜘蛛,同时将从文件中读取 URL 并发出请求并下载图像。

为此我们必须使用 ImagesPipeline

起步阶段会比较困难,但我建议你学习一下Scrapy。 Scrapy 是 Python.

中的一个网络爬虫框架

更新:

import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)

    print(soup)

    for tag in soup.findAll('img'):
        print (tag)
# process(url)

def main():
    url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
    process(url)


if __name__ == "__main__":
    main()

o/p

<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"

更新二:

with open(the_filename, 'w') as f:
    for s in image_links:
        f.write(s + '\n')