尝试提取特定 div 和子 div 下的数据
Trying extract data under of specific div and sub div
我正在尝试获取它,以便让它打印书名和章节,但只打印每本书和书名。
所以基本上
"The First Book of Jacob"
第 1-7 章
而不是遍历所有书籍。
这是页面布局(url 包含在 python 代码中)
<dl>
<dt>Title</dt>
<dd>
<dl>
<dt>Sub Title</dt>
</dl>
</dd>
<dt>Title 2</dt>
<dd>
<dl>
<dt>Sub Title 2</dt>
</dl>
</dd>
</dl>
#this continues for Title 3, Sub title 3, etc etc
这里是 python 代码
import requests
import bs4
scripture_url = 'http://scriptures.nephi.org/docbook/bom/'
response = requests.get(scripture_url)
soup = bs4.BeautifulSoup(response.text)
links = soup.select('dl dd dt')
for item in links:
title = str(item.get_text()).split(' ', 1)[1]
print title
这是输出
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Chapter 21
Chapter 22
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Chapter 21
Chapter 22
Chapter 23
Chapter 24
Chapter 25
Chapter 26
Chapter 27
Chapter 28
Chapter 29
Chapter 30
Chapter 31
Chapter 32
Chapter 33
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 1
Chapter 1
只是切断数组中的最后 2 个,控件不是那么细粒度,因为 html 标签中没有任何 id 或名称
links = soup.select('dl dd dt')
for item in links[:-2]:
title = str(item.get_text()).split(' ', 1)[1]
print title
假设您知道它们始终是第一个和第二个值,您可以使用数组引用:
title = links[0];
subtitle = links[1];
你可以尝试这样的事情。首先,找一本书,例如书名 "The Book of Jacob" :
book_title = 'The Book of Jacob'
book = soup.find('a', text=book_title)
print book.text
然后 select <dd>
即书名的直接兄弟,并在该 <dd>
元素中找到所有对应的章节 :
links = book.parent.select('+ dd > dl > dt')
for item in links:
title = str(item.get_text()).split(' ', 1)[1]
print title
输出:
The Book of Jacob
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
我正在尝试获取它,以便让它打印书名和章节,但只打印每本书和书名。
所以基本上 "The First Book of Jacob" 第 1-7 章
而不是遍历所有书籍。
这是页面布局(url 包含在 python 代码中)
<dl>
<dt>Title</dt>
<dd>
<dl>
<dt>Sub Title</dt>
</dl>
</dd>
<dt>Title 2</dt>
<dd>
<dl>
<dt>Sub Title 2</dt>
</dl>
</dd>
</dl>
#this continues for Title 3, Sub title 3, etc etc
这里是 python 代码
import requests
import bs4
scripture_url = 'http://scriptures.nephi.org/docbook/bom/'
response = requests.get(scripture_url)
soup = bs4.BeautifulSoup(response.text)
links = soup.select('dl dd dt')
for item in links:
title = str(item.get_text()).split(' ', 1)[1]
print title
这是输出
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Chapter 21
Chapter 22
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Chapter 21
Chapter 22
Chapter 23
Chapter 24
Chapter 25
Chapter 26
Chapter 27
Chapter 28
Chapter 29
Chapter 30
Chapter 31
Chapter 32
Chapter 33
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 1
Chapter 1
只是切断数组中的最后 2 个,控件不是那么细粒度,因为 html 标签中没有任何 id 或名称
links = soup.select('dl dd dt')
for item in links[:-2]:
title = str(item.get_text()).split(' ', 1)[1]
print title
假设您知道它们始终是第一个和第二个值,您可以使用数组引用:
title = links[0];
subtitle = links[1];
你可以尝试这样的事情。首先,找一本书,例如书名 "The Book of Jacob" :
book_title = 'The Book of Jacob'
book = soup.find('a', text=book_title)
print book.text
然后 select <dd>
即书名的直接兄弟,并在该 <dd>
元素中找到所有对应的章节 :
links = book.parent.select('+ dd > dl > dt')
for item in links:
title = str(item.get_text()).split(' ', 1)[1]
print title
输出:
The Book of Jacob
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7