如何从 Python 中的 Markdown 文件中获取图像 URL 列表？

Question

我正在寻找这样的东西：

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''

print get_images_url_from_markdown(data)

returns 文本中的图像 URL 列表：

['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']

有什么可用的吗，还是我必须自己用 BeautifulSoup 抓取 Markdown？

Answer 1

Python-Markdown 有一个广泛的 Extension API. In fact, the Table of Contents 扩展基本上做你想用标题（而不是图像）加上一堆你不需要的东西（比如添加唯一的 id 属性和构建TOC 的嵌套列表）。

文档被解析后，它被包含在一个ElementTree object and you can use a treeprocessor中以在树被序列化为文本之前提取你想要的数据。请注意，如果您将任何图像包含为原始 HTML，这将无法找到这些图像（在这种情况下您需要解析 HTML 输出和提取）。

按照此 tutorial 开始，只是您需要创建 treeprocessor 而不是内联 Pattern。你应该得到这样的结果：

import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension

# First create the treeprocessor

class ImgExtractor(Treeprocessor):
    def run(self, doc):
        "Find all images and append to markdown.images. "
        self.markdown.images = []
        for image in doc.findall('.//img'):
            self.markdown.images.append(image.get('src'))

# Then tell markdown about it

class ImgExtExtension(Extension):
    def extendMarkdown(self, md, md_globals):
        img_ext = ImgExtractor(md)
        md.treeprocessors.add('imgext', img_ext, '>inline')

# Finally create an instance of the Markdown class with the new extension

md = markdown.Markdown(extensions=[ImgExtExtension()])

# Now let's test it out:

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
html = md.convert(data)
print md.images

以上输出：

[u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']

如果您真的想要一个 returns 列表中的函数，只需将其全部打包即可。

如何从 Python 中的 Markdown 文件中获取图像 URL 列表？

How can I get a list of image URLs from a Markdown file in Python?

python

markdown