在解析脚本中的数据时获取 Phone 数字时出现问题

Question

代码

import requests
from bs4 import BeautifulSoup as bs
my_url='https://www.olx.com.pk/item/oppo-f17-pro8128-iid-1034320813'


with requests.session() as s:
    r=s.get(my_url)
    page_html=bs(r.content,'html.parser')
    safe=page_html.findAll('script')
    print("The Length if Script is {0}:".format(len(safe)))
    for i in safe:
        if "+92" in str(i):
             print(i)

查询

我想使用 python 脚本获取 phone 实际存在于 windows.state 中的数字，但我不知道如何解析 window.state.Will 非常感谢如果你帮我解决这个问题。提前致谢！

Answer 1

正如我在评论中提到的，window.state 出现在第 7 个 <script> 标签内。

我提取了脚本标签的内容并对 phoneNumber 进行了字符串搜索，找到了它的索引并能够获取您需要的数据。

从 JSON 中提取数据会更容易，但数据不是 JSON 格式。

import bs4 as bs
import requests

url = 'https://www.olx.com.pk/item/oppo-f17-pro8128-iid-1034320813'
resp = requests.get(url)

# Convert the response text to HTML soup object
soup = bs.BeautifulSoup(resp.text, 'html.parser')

# Select the 7th script tag (that is where the data you need is present)
s = soup.findAll('script')[6]

# Extract the contents of script. This will be a string type.
f = s.contents[0]

# Find the index of substring "phoneNumber" - the data that you need.
idx = f.index('phoneNumber')

# Since you need the phone number, use string slicing and extract the data.
print(f[idx-1: idx + 28])

# Output

"phoneNumber":"+923077250739"

Answer 2

我可能只使用一个简单的正则表达式来定位 telephoneNumber

之后的“”内的字符串

import requests, re

r = requests.get('https://www.olx.com.pk/item/oppo-f17-pro8128-iid-1034320813')
print(re.search(r'phoneNumber":"(.*?)"', r.text).group(1))

在解析脚本中的数据时获取 Phone 数字时出现问题

Trouble in Getting Phone Number While Parsing the data inside the script

javascript

python

screen-scraping

request

web

代码

查询