Python: 脚本没有写入来自变量的链接
Python: script is not writing the links from variable
下面是我的脚本...
我觉得我缺少一行代码来使它正常工作。使用 Reddit 作为测试源来废弃运动 links。
# import libraries
import bs4
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.reddit.com/r/BoxingStreams/comments/6w2vdu/mayweather_vs_mcgregor_archive_footage/'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
hyperli = page_soup.findAll("form")
filename = "sportstreams.csv"
f = open(filename, "w")
headers = "Sport Links"
f.write(headers)
for containli in hyperli:
link = containli.a["href"]
print(link)
f.write(str(link)+'\n')
f.close()
一切正常,只是它只从第一行 [0] 中获取 link。如果我不使用代码 ["href"]
那么它会添加所有 (a href links) 除了它还会添加单词 NONE到 CSV 文件。使用
["href"]
会(我希望)只添加 http links 而避免添加单词 NONE。
我在这里错过了什么?
如文档中所述Navigating using tag names:
Using a tag name as an attribute will give you only the first tag by that name
...
If you need to get all the <a>
tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all()
:
对于您的情况,您可以使用 page_soup.select("form a[href]")
查找表单中具有 href
属性的所有链接。
links = page_soup.select("form a[href]")
for link in links:
href = link["href"]
print(href)
f.write(href + "\n")
下面是我的脚本...
我觉得我缺少一行代码来使它正常工作。使用 Reddit 作为测试源来废弃运动 links。
# import libraries
import bs4
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.reddit.com/r/BoxingStreams/comments/6w2vdu/mayweather_vs_mcgregor_archive_footage/'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
hyperli = page_soup.findAll("form")
filename = "sportstreams.csv"
f = open(filename, "w")
headers = "Sport Links"
f.write(headers)
for containli in hyperli:
link = containli.a["href"]
print(link)
f.write(str(link)+'\n')
f.close()
一切正常,只是它只从第一行 [0] 中获取 link。如果我不使用代码 ["href"]
那么它会添加所有 (a href links) 除了它还会添加单词 NONE到 CSV 文件。使用
["href"]
会(我希望)只添加 http links 而避免添加单词 NONE。
我在这里错过了什么?
如文档中所述Navigating using tag names:
Using a tag name as an attribute will give you only the first tag by that name
...
If you need to get all the<a>
tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such asfind_all()
:
对于您的情况,您可以使用 page_soup.select("form a[href]")
查找表单中具有 href
属性的所有链接。
links = page_soup.select("form a[href]")
for link in links:
href = link["href"]
print(href)
f.write(href + "\n")