如何使用 python 和 BeautifulSoup 从 HTML 中 return 不带标签的文本？

Question

我一直在尝试 return 来自网站的文本。我正在尝试从以下示例中 return ownerId 和 unitId。非常感谢任何帮助。

<script>
    h1.config.days = "7";
    h1.config.hours = "24";
    h1.config.color = "blue";
    h1.config.ownerId = 7321;
    h1.config.locationId = 1258;
    h1.config.unitId = "164";
</script>

Answer 1

您可以像这样使用 Beautiful Soup：

#!/usr/bin/env python

from bs4 import BeautifulSoup

html = '''
<script>
    h1.config.days = "7";
    h1.config.hours = "24";
    h1.config.color = "blue";
    h1.config.ownerId = 7321;
    h1.config.locationId = 1258;
    h1.config.unitId = "164";
</script>
'''

soup = BeautifulSoup(html, "html.parser")
jsinfo = soup.find("script")

d = {}
for line in jsinfo.text.split('\n'):
    try:
        d[line.split('=')[0].strip().replace('h1.config.','')] = line.split('=')[1].lstrip().rstrip(';')
    except IndexError:
        pass

print 'OwnerId:  {}'.format(d['ownerId'])
print 'UnitId:   {}'.format(d['unitId'])

这将产生以下结果：

OwnerId:  7321
UnitId:   "164"

同样，通过这种方式，您也可以访问任何其他变量，方法是 d['variable']。

更新

现在，如果您必须处理多个 <script> 标签，要遍历它们，您可以这样做：

jsinfo = soup.find_all("script")

现在，jsinfo 是 <class 'bs4.element.ResultSet'> 的类型，您可以像普通的 list 一样遍历它。

现在要提取 lat 和 lon 你可以简单地做：

#!/usr/bin/env python

from bs4 import BeautifulSoup
import requests

url = 'https://www.your_url'
# the user-agent you specified in the comments
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'}

html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
jsinfo = soup.find_all("script")

list_of_interest = ['hl.config.lat', 'hl.config.lon']

d = {}
for line in jsinfo[9].text.split('\n'):
    if any(word in line for word in list_of_interest):
        k,v = line.strip().replace('hl.config.','').split(' = ')
        d[k] = v.strip(';')

print 'Lat => {}'.format(d['lat'])
print 'Lon => {}'.format(d['lon'])

这将产生以下结果：

Lat => "28.06794"
Lon => "-81.754349"

通过在 list_of_interest 中附加更多值，如果您愿意，您也可以访问一些其他变量！

如何使用 python 和 BeautifulSoup 从 HTML 中 return 不带标签的文本？

How to return text from HTML without tag using python and BeautifulSoup?

python

urllib

beautifulsoup