如何获取网页中的所有可见文本（不是 html 来源）？

Question

例如，我想让文本显示在 "www.google.com" 就像在 chrome 中打开它并按 ctrl+a & ctrl+c:

..   
Google PrivacyTermsSettingsAdvertisingBusinessAboutHow Search works

而不是：

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>Google</title><script nonce="kYKSVIWLPxNkDhoVCq276A==">(function(){window.google={kEI:'ZqUZXruXDNfT-
...

我已经试过requests_html模型像blow:

import requests_html
s = requests_html.HTMLSession()
page = s.get('https://www.google.com')
print(page.html.text)

但它仍然显示 html 之类的打击:

Google
(function(){window.google={kEI:'y6cZXu3LJ8SkwAPWz6KIBA',kEXPI:'31',authuser:0,kGL:'ZZ',kBL:'JGpW'};google.sn='webhp';google.kHL='en';google.jsfs='Ffpdje';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};
...

那么，我怎样才能像按 ctrl+a 和 ctrl+c 那样让页面上显示所有文本？

谢谢。

Answer 1

有几种方法可以做到这一点，但我通常使用的一种是：

from bs4 import BeautifulSoup as bs
import requests_html
s = requests_html.HTMLSession()
page = s.get('https://www.google.com')
soup=bs(page.text,'lxml')
print(soup.get_text())

输出：

About Store GmailImagesSign in Remove Report inappropriate predictions PrivacyTermsSettingsSearch settingsAdvanced searchYour data in SearchHistorySearch HelpSend feedbackAdvertisingBusiness How Search works

如何获取网页中的所有可见文本（不是 html 来源）？

How to get all visible text in a web page (not html source)?

python

python-3.x

python-requests

python-requests-html