Web Scraping编码价格
Web Scraping coded price
虽然网络抓取一篇文章,但价格在元素中而不是在资源中。取而代之的是以下编码文本
<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value) {
return base64UTF8Codec.decode(arguments[0])
};
replaceWith(
document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'),
f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=')
);
</script>
如何将文本解码为价格?
文本采用 base64 编码。如果您可以使用 beautifulsoup 找到正确的 <script>
标签,您可以使用 re
模块提取正确的信息:
import re
import base64
from bs4 import BeautifulSoup
txt = '''<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
</script>'''
soup = BeautifulSoup(txt, 'html.parser')
# 1. locate the right <script> tag
script = soup.script
# 2. get coded text from the script tag
coded_text = re.findall(r".*\('(.*?)'\)\);", script.text)[0]
# 3. decode the text
decoded_text = base64.b64decode(coded_text) # b'\n <span class="pull-right"> 2.590,- </span>\n '
# 4. get the price from the decoded text
soup2 = BeautifulSoup(decoded_text, 'html.parser')
print(soup2.span.get_text(strip=True))
打印:
2.590,-
虽然网络抓取一篇文章,但价格在元素中而不是在资源中。取而代之的是以下编码文本
<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value) {
return base64UTF8Codec.decode(arguments[0])
};
replaceWith(
document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'),
f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=')
);
</script>
如何将文本解码为价格?
文本采用 base64 编码。如果您可以使用 beautifulsoup 找到正确的 <script>
标签,您可以使用 re
模块提取正确的信息:
import re
import base64
from bs4 import BeautifulSoup
txt = '''<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
</script>'''
soup = BeautifulSoup(txt, 'html.parser')
# 1. locate the right <script> tag
script = soup.script
# 2. get coded text from the script tag
coded_text = re.findall(r".*\('(.*?)'\)\);", script.text)[0]
# 3. decode the text
decoded_text = base64.b64decode(coded_text) # b'\n <span class="pull-right"> 2.590,- </span>\n '
# 4. get the price from the decoded text
soup2 = BeautifulSoup(decoded_text, 'html.parser')
print(soup2.span.get_text(strip=True))
打印:
2.590,-