如何使用 Python 2.7.10 遍历列表并提取引号之间的文本
How to iterate through a list and extract text between quotation marks using Python 2.7.10
我正在尝试遍历一个长列表(我们称之为 url_list
),其中每个项目如下所示:
<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>,
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>,
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>,
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>,
等等。我想遍历列表并只保留前两个引号之间的文本,并丢弃其余部分 - 即:
https://www.example.com/5th-february-2018/,
https://www.example.com/4th-february-2018/,
https://www.example.com/3rd-february-2018/,
https://www.example.com/2nd-february-2018/,
所以基本上我正在尝试 return 一个漂亮干净的 url 列表。我没有太多运气遍历列表并拆分引号 - 有更好的方法吗?有没有办法丢弃 itemprop=
字符串后的所有内容?
您是否尝试过使用 split 函数在 " 处进行拆分,然后从结果列表中取出第二个条目?
urls=[]
for url_entry in url_list:
url = url_entry.split('\"')[1]
urls.append(url)
使用正则表达式:
import re
url_list = ['<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>', '<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>']
for i in url_list:
print re.search("(?P<url>https?://[^\s]+)/", i).group("url")
输出:
https://www.example.com/5th-february-2018
https://www.example.com/4th-february-2018
这听起来有点像 XY problem。
如果您曾经(或正在)使用 BeautifulSoup
来解析您的 HTML 它会变得容易得多:
from bs4 import BeautifulSoup
html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''
soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/
我正在尝试遍历一个长列表(我们称之为 url_list
),其中每个项目如下所示:
<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>,
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>,
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>,
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>,
等等。我想遍历列表并只保留前两个引号之间的文本,并丢弃其余部分 - 即:
https://www.example.com/5th-february-2018/,
https://www.example.com/4th-february-2018/,
https://www.example.com/3rd-february-2018/,
https://www.example.com/2nd-february-2018/,
所以基本上我正在尝试 return 一个漂亮干净的 url 列表。我没有太多运气遍历列表并拆分引号 - 有更好的方法吗?有没有办法丢弃 itemprop=
字符串后的所有内容?
您是否尝试过使用 split 函数在 " 处进行拆分,然后从结果列表中取出第二个条目?
urls=[]
for url_entry in url_list:
url = url_entry.split('\"')[1]
urls.append(url)
使用正则表达式:
import re
url_list = ['<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>', '<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>']
for i in url_list:
print re.search("(?P<url>https?://[^\s]+)/", i).group("url")
输出:
https://www.example.com/5th-february-2018
https://www.example.com/4th-february-2018
这听起来有点像 XY problem。
如果您曾经(或正在)使用 BeautifulSoup
来解析您的 HTML 它会变得容易得多:
from bs4 import BeautifulSoup
html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''
soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/