python 中的链接

Question

我得到了一个适合在页面源中获取 Hyperlink 的正则表达式。

当我运行这段代码

import sys,re
import webpage_get

def print_links(page):

    print '[+] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+',page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'
    a = open(r'C:\Users\noh\Desktop\ApplicationDevelopment\Second Course work\result.txt','w')
    for link in links:
        a.write(link)
    a.close()

def main():
    sys.argv.append('http://socrdlvideo.napier.ac.uk/~csn11118/CSN08115/index.html')
##    sys.argv.append('http://www.napier.ac.uk/Pages/home.aspx')

    if len(sys.argv) != 2:
        print '[-] usage: webpage_getlinks URL'
        return

    page = webpage_get.wget(sys.argv[1])
    print_links(page)

if __name__ == '__main__':
    main()

结果将类似于：

href="http://www.rottentomatoes.com/m/star_wars/trailer/">Star Wars Trailer</a>

我真正需要的只是 link 本身，没有两边的附加字符串，例如：

http://www.rottentomatoes.com/m/star_wars/trailer/

如果你能告诉我如何去掉两边的加法字符串就太好了。

Answer 1

您遇到的问题是您的模式无法正确解析。

使用模式：href\=.*(http\:.+)\"替换<a.*href\=.*http\:.+

尝试使用此模式：https://regex101.com/r/WT1AQ7/1

PS：使用 () 分组您想要的实际内容。

Answer 2

re.findall(r'href="(http.+?)"', string=your string)

用()捕捉你的需求

import re
file = '''
 href="https://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a> 
                with <a href="http://blog.whosebug.com/2009/06/attribution-required/" rel="license">attribution required</a> '''
matchs = re.findall(r'href="(http.+?)"', string=file)
for link in matchs:
    print(link)

输出：

https://creativecommons.org/licenses/by-sa/3.0/
http://blog.whosebug.com/2009/06/attribution-required/

代码有效！！请学习正则表达式，而不是复制正则表达式！

Answer 3

试试这个正则表达式：

(?<=href=\")http.*?\/(?=\")

或者这个：

http.*?\/(?=\")

演示：https://regex101.com/r/YXw2y5/1

因此，在您的代码中，更改此行：

links = re.findall(r'\<a.*href\=.*http\:.+',page)

对此：

links = re.findall(r'(?<=href=\")http.*?\/(?=\")',page)

python 中的链接

Links in python

python

regex

trim

hyperlink