在 FILE 类型文件中找到规范 link - BeautifulSoup
Find the canonical link in a FILE type file - BeautifulSoup
我有很多 FILE 类型的文件(保存在您系统上的文件没有任何扩展名)。这些文件包含 HTML 已解析的新闻内容 - 网站。我需要找到隐藏在那里的规范 link (URL)。我正在使用此代码首先测试其中一个文件 -
with open(file, 'r') as f:
html_text = f.read()
soup = BeautifulSoup(html_text, 'html.parser')
link = soup.find('link', rel = 'canonical')
但我收到 NoneType 对象错误。我也尝试了这些变化
# Variation 1
link = soup.find('link', {'rel':'canonical'})
# Variation 2
link = soup.find('link', rel = 'canonical')['href']
# Variation 3
link = soup.find('link', {'rel':'canonical'}).get['href']
# Variation 4
link = soup.find('link', {'rel':'canonical'})['href']
我也尝试了 soup.find_all 变体,但这些也失败了。 (错误:NoneType 对象不是 subscriptable/NoneType 对象没有属性 href)
我通过在记事本中打开它来手动检查我的文件,我发现其中有一个片段 <link rel=\"canonical\" href=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/>
证明规范确实不是 NoneType 对象。
这似乎是一个如此简单的问题,但似乎有一些我无法捕捉到的错误。我在 Whosebug 上浏览了很多处理类似问题的问题,并尝试了他们的解决方案(因此有变体)。感谢任何帮助。
编辑 -
根据要求添加文件内容
"<!DOCTYPE html>\n<!--\n ______ __ ______ __ __\n/\__ _\/\ \ /\__ _\ /\ \__ /\ \__\n\/_/\ \/\ \ \___ __\/_/\ \/ ___\ \ ,_\ __ _ __ ___ __ _____\ \ ,_\\n \ \ \ \ \ _ `\ /'__`\ \ \ \ /' _ `\ \ \/ /'__`\/\`'__\/'___\ /'__`\/\ '__`\ \ \/\n \ \ \ \ \ \ \ \/\ __/ \_\ \__/\ \/\ \ \ \_/\ __/\ \ \//\ \__//\ __/\ \ \L\ \ \ \_\n \ \_\ \ \_\ \_\ \____\ /\_____\ \_\ \_\ \__\ \____\\ \_\\ \____\ \____\\ \ ,__/\ \__\\n \/_/ \/_/\/_/\/____/ \/_____/\/_/\/_/\/__/\/____/ \/_/ \/____/\/____/ \ \ \/ \/__/\n \ \_\\n \/_/\n-->\n<html lang=\"en\">\n <head>\n <title>The Intercept</title>\n <meta charset=\"utf-8\">\n <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no\">\n <meta name=\"msapplication-TileColor\" content=\"#000000\">\n <meta name=\"msapplication-TileImage\" content=\"/static/mstile-144x144.png\">\n <meta name=\"msapplication-config\" content=\"/static/browserconfig.xml\">\n <meta name=\"theme-color\" content=\"#ffffff\">\n <meta property=\"og:url\" content=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\">\n <link rel=\"apple-touch-icon\" sizes=\"57x57\" href=\"/static/apple-touch-icon-57x57.png\">\n <link rel=\"apple-touch-icon\" sizes=\"60x60\" href=\"/static/apple-touch-icon-60x60.png\">\n <link rel=\"apple-touch-icon\" sizes=\"72x72\" href=\"/static/apple-touch-icon-72x72.png\">\n <link rel=\"apple-touch-icon\" sizes=\"76x76\" href=\"/static/apple-touch-icon-76x76.png\">\n <link rel=\"apple-touch-icon\" sizes=\"114x114\" href=\"/static/apple-touch-icon-114x114.png\">\n <link rel=\"apple-touch-icon\" sizes=\"120x120\" href=\"/static/apple-touch-icon-120x120.png\">\n <link rel=\"apple-touch-icon\" sizes=\"144x144\" href=\"/static/apple-touch-icon-144x144.png\">\n <link rel=\"apple-touch-icon\" sizes=\"152x152\" href=\"/static/apple-touch-icon-152x152.png\">\n <link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"/static/apple-touch-icon-180x180.png\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-32x32.png\" sizes=\"32x32\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/android-chrome-192x192.png\" sizes=\"192x192\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-96x96.png\" sizes=\"96x96\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-16x16.png\" sizes=\"16x16\">\n <link rel=\"manifest\" href=\"/static/manifest.json\">\n <link rel=\"shortcut icon\" href=\"/static/favicon.ico\">\n <link rel=\"canonical\" href=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/>\n \n \n \n \n <!--[if !IE]><!--><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/app42e762a729b53f810f04.css\"><!--<![endif]-->\n <!--[if gte IE 9]><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/app42e762a729b53f810f04.css\"><![endif]-->\n <!--[if lte IE 8]><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/ie842e762a729b53f810f04.css\"><![endif]-->\n \n <!--[if lte IE 8]>\n <script>\n document.createElement('header');\n document.createElement('nav');\n document.createElement('section');\n document.createElement('article');\n document.createElement('aside');\n document.createElement('footer');\n document.createElement('hgroup');\n document.createElement('picture');\n </script>\n <![endif]-->\n <script id=\"ad-block-test\" src=\"/ads.js\" data-blocked=\"true\"></script>\n </head>\n <body>\n <script src=\"/assets/sniffer42e762a729b53f810f04.js\"></script>\n <div id=\"Root\"><div class=\"InterceptWrapper\" data-reactroot=\"\" data-reactid=\"1\" data-react-checksum=\"1442884202\"><div data-reactid=\"2\"><!-- react-empty: 3 --><!-- react-empty: 4 --></div><div class=\"Header Header--en Header--route-theintercept\" data-reactid=\"5\"><span data-reactid=\"6\"><div class=\"Header-hamburger\" data-reactid=\"7\"><a class=\"Header-hamburger-link\" style=\"color:;\" href=\"/2020/10/27/senator-perdue-ossoff-china/feed/?menu=1\" data-reactid=\"8\"><span class=\"Icon Icon--Menu icon-TI_Menu\" data-reactid=\"9\"></span></a></div><nav class=\"Header-menu\" data-reactid=\"10\"><div class=\"Logo\" data-reactid=\"11\"><div class=\"Logo-bg-block\" data-reactid=\"12\"><div class=\"GridContainer\" data-reactid=\"13\"><div class=\"GridRow\" data-reactid=\"14\"><div class=\"Logo-bg\" data-reactid=\"15\"></div></div></div></div><div class=\"Logo-block\" data-reactid=\"16\"><a href=\"/\" data-reactid=\"17\"><span class=\"Logo-fallback\" style=\"color:#111;\" data-reactid=\"18\"><!-- react-text: 19 -->The<!-- /react-text --><br data-reactid=\"20\"/><!-- react-text: 21 -->Intercept_<!-- /react-text --><span data-reactid=\"22\"><br data-reactid=\"23\"/><!-- react-text: 24 --><!-- /react-text --></span></span><svg class=\"Logo-svg\" height=\"50px\" version=\"1.1\" viewBox=\"0 0 140 50\" width=\"140px\" data-reactid=\"25\"><g data-reactid=\"26\"><path class=\"Logo-path\" d=\"M51.731,30.458c1.246,0,2.264,1.425,2.264,3.206l-4.605,0.56C49.517,31.781,50.28,30.458,51.731,30.458 M40.789,8.601 c1.247,0,2.265,1.424,2.265,3.206l-4.606,0.559C38.575,9.924,39.339,8.601,40.789,8.601 M92.774,30.458 c1.247,0,2.264,1.425,2.264,3.206l-4.605,0.56C90.56,31.781,91.323,30.458,92.774,30.458 M128.295,46.463H140v-2.188h-11.705 V46.463z M106.642,31.679c0.279-0.101,0.61-0.178,1.272-0.178c2.544,0,4.275,1.68,4.275,5.42c0,3.104-1.705,5.216-4.173,5.216 c-0.408,0-0.891-0.076-1.374-0.229V31.679z M68.652,33.206h3.18v-4.097c-0.61-0.254-1.017-0.356-1.603-0.356 c-0.992,0-2.188,0.662-3.435,1.654l-0.916,0.713v-2.367h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h9.313v-0.865 l-2.367-0.763v-9.39c0.611-0.229,1.68-0.407,1.934-0.407L68.652,33.206z M80.484,28.753c-3.995,0-8.372,2.494-8.372,8.066 c0,4.631,3.079,6.82,6.412,6.82c1.146,0,2.469-0.153,4.402-0.967l2.341-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.545,0.382 c-3.155,0-5.547-1.883-5.547-6.132c0-2.952,1.705-4.555,3.257-4.555c0.153,0,0.331,0.025,0.484,0.102l1.399,2.646h2.926v-3.613 C83.741,29.16,82.138,28.753,80.484,28.753 M123.792,26.489h-1.171l-5.521,3.18v1.069l1.857,0.178v10.076 c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636l-0.178-0.941c-0.33,0.076-0.916,0.101-1.246,0.101 c-1.858,0-3.079-0.687-3.079-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M40.128,26.489h-1.171l-5.522,3.18v1.069l1.858,0.178 v10.076c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636L44.453,41.4 c-0.331,0.076-0.916,0.101-1.247,0.101c-1.857,0-3.078-0.687-3.078-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M58.626,35.293 v-0.407c0-2.189-1.476-6.133-6.437-6.133c-3.766,0-7.558,2.494-7.558,8.066c0,4.555,3.334,6.82,6.972,6.82 c1.069,0,2.621-0.178,4.428-0.967l2.264-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.596,0.382c-2.799,0-5.878-1.628-6.005-5.852 H58.626z M99.67,35.293v-0.407c0-2.189-1.476-6.133-6.438-6.133c-3.766,0-7.557,2.494-7.557,8.066c0,4.555,3.333,6.82,6.972,6.82 c1.068,0,2.62-0.178,4.427-0.967l2.265-0.993l-0.331-0.916c-0.585,0.229-1.4,0.382-2.596,0.382c-2.798,0-5.877-1.628-6.005-5.852 H99.67z M47.685,13.435v-0.407c0-2.188-1.476-6.132-6.438-6.132c-3.766,0-7.557,2.493-7.557,8.066c0,4.555,3.333,6.819,6.972,6.819 c1.069,0,2.621-0.178,4.427-0.967l2.265-0.992l-0.331-0.916c-0.585,0.229-1.399,0.382-2.595,0.382 c-2.799,0-5.878-1.629-6.005-5.853H47.685z M6.438,25.598v15.725L3.97,42.265v0.992h10.865v-0.992l-2.468-0.942V25.598l2.468-0.942 v-0.992H3.97v0.992L6.438,25.598z M31.781,41.629V33.74c0-2.926-1.094-4.987-4.045-4.987c-1.222,0-2.138,0.382-3.334,0.916 l-2.163,0.942v-1.858h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h8.804v-0.865l-1.858-0.763v-9.924 c0.458-0.127,1.247-0.204,1.833-0.204c1.577,0,2.875,0.865,2.875,3.461v6.667l-1.858,0.763v0.865h8.804v-0.865L31.781,41.629z M106.642,28.753h-0.662l-6.285,1.272v0.967l2.112,0.637v15.598l-2.112,0.763v0.865h9.567V47.99l-2.62-0.763v-3.588h0.992 c5.954,0,9.287-3.614,9.287-8.626c0-4.301-3.002-6.26-5.241-6.26c-0.891,0-2.443,0.535-3.868,1.247l-0.967,0.483h-0.203V28.753z M31.527,19.771v-7.888c0-2.926-1.094-4.987-4.046-4.987c-1.221,0-2.137,0.381-3.333,0.916l-2.163,0.941V0h-0.662l-4.911,1.807H0 v4.911h2.316L3.257,3.69h3.181v15.776L3.97,20.407V21.4h10.61v-0.993l-2.213-0.941V3.69h4.783v16.081l-1.857,0.763V21.4h8.55 v-0.866l-1.858-0.763V9.873c0.458-0.127,1.247-0.229,1.832-0.229c1.578,0,2.875,0.865,2.875,3.46v6.667l-1.857,0.763V21.4h8.804 v-0.866L31.527,19.771z\" fill=\"#111\" data-reactid=\"27\"></path></g></svg></a></div></div><ul class=\"Header-language-list\" data-reactid=\"28\"><li class=\"Header-language-list-item Header-language-list-item--active\" data-reactid=\"29\"><a class=\"Header-language-link\" href=\"/\" data-reactid=\"30\">English</a></li><li class=\"Header-language-list-item\" data-reactid=\"31\"><a class=\"Header-language-link\" href=\"/brasil/\" data-reactid=\"32\">Portugu\u00eas</a></li></ul><div class=\"Header-search-block\" data-reactid=\"33\"><form action=\"/search\" type=\"get\" data-reactid=\"34\"><label class=\"Header-search-label\" for=\"search\" data-reactid=\"35\"><span class=\"Icon Icon--Search icon-TI_Search\" data-reactid=\"36\"></span></label><input id=\"search\" class=\"Header-search-input\" name=\"s\" data-reactid=\"37\"/></form></div><div class=\"Header-menu-mission-block\" data-reactid=\"38\"><ul class=\"Header-menu-list Header-menu-list--collection-items\" data-reactid=\"39\"><li class=\"Header-menu-list-item\" data-reactid=\"40\"><a class=\"Header-menu-link\" href=\"/politics/\" data-reactid=\"41\">Politics</a></li><li class=\"Header-menu-list-item\" data-reactid=\"42\"><a class=\"Header-menu-link\" href=\"/justice/\" data-reactid=\"43\">Justice</a></li><li class=\"Header-menu-list-item\" data-reactid=\"44\"><a class=\"Header-menu-link\" href=\"/national-security/\" data-reactid=\"45\">National Security</a></li><li class=\"Header-menu-list-item\" data-reactid=\"46\"><a class=\"Header-menu-link\" href=\"/world/\" data-reactid=\"47\">World</a></li><li class=\"Header-menu-list-item\" data-reactid=\"48\"><a class=\"Header-menu-link\" href=\"/technology/\" data-reactid=\"49\">Technology</a></li><li class=\"Header-menu-list-item\" data-reactid=\"50\"><a class=\"Header-menu-link\" href=\"/environment/\" data-reactid=\"51\">Environment</a></li></ul></div><div class=\"Header-menu-mission-block\" data-reactid=\"52\"><ul class=\"Header-menu-list Header-menu-list--mission-items\" data-reactid=\"53\"><li class=\"Header-menu-list-item\" data-reactid=\"54\"><a class=\"Header-menu-link\" href=\"/special-investigations/\" data-reactid=\"55\">Special Investigations</a></li><li class=\"Header-menu-list-item\" data-reactid=\"56\"><a class=\"Header-menu-link\" href=\"/voices/\" data-reactid=\"57\">Voices</a></li><li class=\"Header-menu-list-item\" data-reactid=\"58\"><a class=\"Header-menu-link\" href=\"/podcasts/\" data-reactid=\"59\">Podcasts</a></li><li class=\"Header-menu-list-item\" data-reactid=\"60\"><a class=\"Header-menu-link\" href=\"/videos/\" data-reactid=\"61\">Videos</a></li><li class=\"Header-menu-list-item\" data-reactid=\"62\"><a class=\"Header-menu-link\" href=\"/documents/\" data-reactid=\"63\">Documents</a></li><li class=\"Header-menu-list-item\" data-reactid=\"64\"><a class=\"Header-menu-link\" href=\"https://join.theintercept.com/donate/now?source=web_intercept_20200601_hamburger\" data-reactid=\"65\"><div class=\"Header-menu-list-item-button\" data-reactid=\"66\"><!-- react-text: 67 -->Become A Member<!-- /react-text --><span class=\"Icon Icon--Arrow_02_Right icon-TI_Arrow_02_Right\" data-reactid=\"68\"></span></div></a></li></ul></div><ul class=\"Header-menu-list Header-menu-list--content-items\" data-reactid=\"69\"><li class=\"Header-menu-list-item\" data-reactid=\"70\"><a class=\"Header-menu-link\" href=\"/about/\" data-reactid=\"71\">About</a></li><li class=\"Header-menu-list-item\" data-reactid=\"72\"><a class=\"Header-menu-link\" href=\"/policies/\" data-reactid=\"73\">Editorial Policies</a></li><li class=\"Header-menu-list-item\" data-reactid=\"74\"><a class=\"Header-menu-link\" href=\"/source/\" data-reactid=\"75\">Become a Source</a></li><li class=\"Header-menu-list-item\" data-reactid=\"76\"><a class=\"Header-menu-link\" href=\"/newsletter/?source=web_hamburger\" data-reactid=\"77\">Join Newsletter</a></li></ul><div class=\"Header-footer\" data-reactid=\"78\"><div class=\"Header-social-links\" data-reactid=\"79\"><a class=\"Header-social-link\" target=\"_blank\" data-label=\"facebook\" href=\"https://www.facebook.com/theinterceptflm\" data-reactid=\"80\"><span class=\"Icon Icon--Facebook icon-TI_Facebook\" data-reactid=\"81\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"twitter\" href=\"https://twitter.com/theintercept\" data-reactid=\"82\"><span class=\"Icon Icon--Twitter icon-TI_Twitter\" data-reactid=\"83\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"instagram\" href=\"https://www.instagram.com/theintercept/\" data-reactid=\"84\"><span class=\"Icon Icon--Instagram icon-TI_Instagram\" data-reactid=\"85\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"tumblr\" href=\"https://the-intercept.tumblr.com\" data-reactid=\"86\"><span class=\"Icon Icon--Tumblr icon-TI_Tumblr\" data-reactid=\"87\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"snapchat\" href=\"https://www.snapchat.com/add/theintercept\" data-reactid=\"88\"><span class=\"Icon Icon--Snapchat icon-TI_Snapchat\" data-reactid=\"89\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"flipboard\" href=\"https://flipboard.com/@TheIntercept\" data-reactid=\"90\"><span class=\"Icon Icon--Flipboard icon-TI_Flipboard\" data-reactid=\"91\"></span></a><a class=\"Header-social-link\" data-label=\"rss\" href=\"/feeds/\" data-reactid=\"92\"><span class=\"Icon Icon--RSS icon-TI_RSS\" data-reactid=\"93\"></span></a></div><img class=\"Header-FLM-svg\" src=\"/static/FLM.svg\" alt=\"First Look Media logo\" data-reactid=\"94\"/><p class=\"Header-TM\" data-reactid=\"95\">The Intercept is a First Look Media Company.</p><cite class=\"Header-copyright\" data-reactid=\"96\"><!-- react-text: 97 -->\u00a9 First Look Media. <!-- /react-text --><span data-reactid=\"98\">All rights reserved</span></cite><ul class=\"Header-footer-list\" data-reactid=\"99\"><li class=\"Header-footer-list-item\" data-reactid=\"100\"><a class=\"Header-footer-link\" data-label=\"terms of use\" href=\"/terms-use/\" data-reactid=\"101\"><span data-reactid=\"102\">Terms of use</span></a></li><li class=\"Header-footer-list-item\" data-reactid=\"103\"><a class=\"Header-footer-link\" data-label=\"privacy policy\" href=\"/privacy-policy/\" data-reactid=\"104\"><span data-reactid=\"105\">Privacy</span></a></li></ul></div></nav></span></div><div class=\"ErrorPage\" data-reactid=\"106\"><div class=\"Logo\" data-reactid=\"107\"><div class=\"Logo-block\" data-reactid=\"108\"><a href=\"/\" data-reactid=\"109\"><span class=\"Logo-fallback\" style=\"color:#fff;\" data-reactid=\"110\"><!-- react-text: 111 -->The<!-- /react-text --><br data-reactid=\"112\"/><!-- react-text: 113 -->Intercept_<!-- /react-text --><span data-reactid=\"114\"><br data-reactid=\"115\"/><!-- react-text: 116 --><!-- /react-text --></span></span><svg class=\"Logo-svg\" height=\"50px\" version=\"1.1\" viewBox=\"0 0 140 50\" width=\"140px\" data-reactid=\"117\"><g data-reactid=\"118\"><path class=\"Logo-path\" d=\"M51.731,30.458c1.246,0,2.264,1.425,2.264,3.206l-4.605,0.56C49.517,31.781,50.28,30.458,51.731,30.458 M40.789,8.601 c1.247,0,2.265,1.424,2.265,3.206l-4.606,0.559C38.575,9.924,39.339,8.601,40.789,8.601 M92.774,30.458 c1.247,0,2.264,1.425,2.264,3.206l-4.605,0.56C90.56,31.781,91.323,30.458,92.774,30.458 M128.295,46.463H140v-2.188h-11.705 V46.463z M106.642,31.679c0.279-0.101,0.61-0.178,1.272-0.178c2.544,0,4.275,1.68,4.275,5.42c0,3.104-1.705,5.216-4.173,5.216 c-0.408,0-0.891-0.076-1.374-0.229V31.679z M68.652,33.206h3.18v-4.097c-0.61-0.254-1.017-0.356-1.603-0.356 c-0.992,0-2.188,0.662-3.435,1.654l-0.916,0.713v-2.367h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h9.313v-0.865 l-2.367-0.763v-9.39c0.611-0.229,1.68-0.407,1.934-0.407L68.652,33.206z M80.484,28.753c-3.995,0-8.372,2.494-8.372,8.066 c0,4.631,3.079,6.82,6.412,6.82c1.146,0,2.469-0.153,4.402-0.967l2.341-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.545,0.382 c-3.155,0-5.547-1.883-5.547-6.132c0-2.952,1.705-4.555,3.257-4.555c0.153,0,0.331,0.025,0.484,0.102l1.399,2.646h2.926v-3.613 C83.741,29.16,82.138,28.753,80.484,28.753 M123.792,26.489h-1.171l-5.521,3.18v1.069l1.857,0.178v10.076 c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636l-0.178-0.941c-0.33,0.076-0.916,0.101-1.246,0.101 c-1.858,0-3.079-0.687-3.079-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M40.128,26.489h-1.171l-5.522,3.18v1.069l1.858,0.178 v10.076c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636L44.453,41.4 c-0.331,0.076-0.916,0.101-1.247,0.101c-1.857,0-3.078-0.687-3.078-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M58.626,35.293 v-0.407c0-2.189-1.476-6.133-6.437-6.133c-3.766,0-7.558,2.494-7.558,8.066c0,4.555,3.334,6.82,6.972,6.82 c1.069,0,2.621-0.178,4.428-0.967l2.264-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.596,0.382c-2.799,0-5.878-1.628-6.005-5.852 H58.626z M99.67,35.293v-0.407c0-2.189-1.476-6.133-6.438-6.133c-3.766,0-7.557,2.494-7.557,8.066c0,4.555,3.333,6.82,6.972,6.82 c1.068,0,2.62-0.178,4.427-0.967l2.265-0.993l-0.331-0.916c-0.585,0.229-1.4,0.382-2.596,0.382c-2.798,0-5.877-1.628-6.005-5.852 H99.67z M47.685,13.435v-0.407c0-2.188-1.476-6.132-6.438-6.132c-3.766,0-7.557,2.493-7.557,8.066c0,4.555,3.333,6.819,6.972,6.819 c1.069,0,2.621-0.178,4.427-0.967l2.265-0.992l-0.331-0.916c-0.585,0.229-1.399,0.382-2.595,0.382 c-2.799,0-5.878-1.629-6.005-5.853H47.685z M6.438,25.598v15.725L3.97,42.265v0.992h10.865v-0.992l-2.468-0.942V25.598l2.468-0.942 v-0.992H3.97v0.992L6.438,25.598z M31.781,41.629V33.74c0-2.926-1.094-4.987-4.045-4.987c-1.222,0-2.138,0.382-3.334,0.916 l-2.163,0.942v-1.858h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h8.804v-0.865l-1.858-0.763v-9.924 c0.458-0.127,1.247-0.204,1.833-0.204c1.577,0,2.875,0.865,2.875,3.461v6.667l-1.858,0.763v0.865h8.804v-0.865L31.781,41.629z M106.642,28.753h-0.662l-6.285,1.272v0.967l2.112,0.637v15.598l-2.112,0.763v0.865h9.567V47.99l-2.62-0.763v-3.588h0.992 c5.954,0,9.287-3.614,9.287-8.626c0-4.301-3.002-6.26-5.241-6.26c-0.891,0-2.443,0.535-3.868,1.247l-0.967,0.483h-0.203V28.753z M31.527,19.771v-7.888c0-2.926-1.094-4.987-4.046-4.987c-1.221,0-2.137,0.381-3.333,0.916l-2.163,0.941V0h-0.662l-4.911,1.807H0 v4.911h2.316L3.257,3.69h3.181v15.776L3.97,20.407V21.4h10.61v-0.993l-2.213-0.941V3.69h4.783v16.081l-1.857,0.763V21.4h8.55 v-0.866l-1.858-0.763V9.873c0.458-0.127,1.247-0.229,1.832-0.229c1.578,0,2.875,0.865,2.875,3.46v6.667l-1.857,0.763V21.4h8.804 v-0.866L31.527,19.771z\" fill=\"#fff\" data-reactid=\"119\"></path></g></svg></a></div></div><div class=\"GridContainer\" data-reactid=\"120\"><div class=\"GridRow\" data-reactid=\"121\"><div class=\"ErrorPage-container\" data-reactid=\"122\"><h2 class=\"ErrorPage-pagetitle\" data-reactid=\"123\">Error 404</h2><h1 class=\"ErrorPage-title\" data-reactid=\"124\">Page not found</h1><p class=\"ErrorPage-text\" data-reactid=\"125\"><!-- react-text: 126 -->We couldn\u2019t find anything at this address. Please check the URL or go to the <!-- /react-text --><a href=\"/\" data-reactid=\"127\">homepage</a><!-- react-text: 128 -->.<!-- /react-text --></p></div></div></div></div><!-- react-empty: 129 --><div style=\"display:none;\" data-reactid=\"130\"><svg\n xmlns=\"http://www.w3.org/2000/svg\"\n xmlns:xlink=\"http://www.w3.org/1999/xlink\"\n height=\"500\"\n width=\"500\"\n viewBox=\"0 0 500 500\"\n aria-labelledby=\"title desc\"\n>\n <title id=\"title\">Filters SVG</title>\n <defs>\n <filter id=\"bleed\" filterUnits=\"objectBoundingBox\">\n <feColorMatrix\n type=\"matrix\"\n values=\"1 0 0 0 0, 0 0.15 0 0 0, 0 0 .20 0 0, 0 0 0 1 0\"\n />\n </filter>\n </defs>\n</svg></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-top\" data-reactid=\"131\"></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-takeover\" data-reactid=\"132\"></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-bottom\" data-reactid=\"133\"></div></div></div>\n <script>\n window.initialStoreTree = {\"bodyClasses\":[],\"categoryPostIDs\":{},\"commentsExpanded\":{},\"contentLanguage\":null,\"dispatcher\":{\"backend\":null,\"node\":null,\"type\":null},\"documentCloud\":{\"document\":{},\"embedUrl\":\"\",\"text\":{}},\"documentIDs\":[],\"error\":{\"message\":\"Page not found\",\"status\":404},\"featureIDs\":[],\"featuresLanguage\":{},\"googleAMPUrl\":\"\",\"hamburgerColor\":null,\"host\":\"theintercept.com\",\"inInitialRender\":false,\"languageLanding\":{},\"liveBlogsUpdatesIDs\":{},\"loading\":false,\"mediaPlayer\":null,\"newsletter\":{\"form\":{\"description\":\"\",\"status\":\"\"}},\"podcastPage\":{},\"podcastsHomepage\":{\"speakingIDs\":[]},\"postLanding\":{\"redirect\":null},\"postsMetaIDs\":{},\"resources\":{\"alerts\":{},\"annotationSets\":{},\"annotations\":{},\"categories\":{},\"comments\":{},\"documents\":{},\"liveBlogs\":{},\"platform\":{\"theintercept\":{\"Article\":{},\"Author\":{},\"Document\":{},\"GeoLocation\":{},\"HttpReturn\":{},\"Podcast\":{},\"PodcastEpisode\":{},\"Section\":{},\"bySpeakingID\":{},\"documentArchives\":{},\"documentReleases\":{},\"nodePromos\":{\"default\":{}}},\"theintercept-brasil\":{\"HttpReturn\":{},\"bySpeakingID\":{},\"latestPromos\":[],\"nodePromos\":{\"default\":{}}}},\"postCommentMeta\":{},\"posts\":{},\"promoBanners\":{},\"series\":{},\"seriesDocuments\":{},\"staff\":{},\"taxonomies\":{},\"timeline\":{}},\"reverseChronIDs\":{},\"route\":{\"names\":[],\"params\":{},\"path\":\"\",\"pathname\":\"\",\"query\":{}},\"routed\":false,\"scrollToPostComments\":null,\"searchResultIDs\":[],\"seriesHomepage\":{\"curatedItems\":{\"postIds\":[],\"seriesSlugs\":[]},\"recentItems\":{\"postIds\":[],\"seriesSlugs\":[]}},\"seriesPostIDs\":{},\"sidToday\":{\"search\":{\"lastCursor\":null,\"loading\":false,\"speakingIDs\":[],\"totalCount\":null}},\"sidTodayFilesUpdateReports\":{\"category\":{}},\"specialSeriesItems\":[],\"squirrelDocumentIDs\":[],\"squirrelIDs\":[],\"staffIDs\":[],\"surveillanceCatalogData\":null,\"surveillanceCatalogVendors\":null,\"tocChapters\":{},\"tracking\":{\"currentUrl\":\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\",\"previousUrl\":null}};\n window.config = {\"assets\":{\"host\":\"\",\"webpack\":false},\"aws_static\":\"https://static.theintercept.com\",\"coral_talk_api_origin\":\"https://talk.theintercept.com\",\"coral_talk_origin\":\"https://talk.theintercept.com\",\"coral_talk_permalink_cutover_date\":\"2019-06-24T14:00:00.000Z\",\"donation_base_url\":\"https://join.theintercept.com/donate/\",\"donation_base_url_brasil\":\"http://catarse.me/intercept/\",\"env\":{\"NODE_ENV\":\"production\"},\"facebook\":{\"tracking_pixel_id\":\"2151258874911575\"},\"google\":{\"id\":\"UA-79475609-15\"},\"graphql_realm_id\":\"UmVhbG06NTJiOWMwOGEtMjQwYS00NzMxLThlYTAtMjMyY2RiYTYwNzBh\",\"graphql_realm_id__brasil\":\"UmVhbG1Db250ZW50OjE3NTU4MzMyLWUwZTQtNDIyYy1iNDcyLWZkMDQzMmRiOGRhYw==\",\"graphql_url\":\"http://read.usq.flmcloud.local:3002/graphql\",\"hash\":\"42e762a729b53f810f04\",\"host\":\"theintercept.com\",\"imgix\":{\"additional_origins\":[\"https://firstlook.org\"],\"domain\":\"theintercept.imgix.net\"},\"logs\":{\"level\":\"info\"},\"onsite_origins\":[\"https://theintercept.com\"],\"origin\":\"https://theintercept.com\",\"override_private_wp_host\":\"theintercept.com\",\"parsely\":{\"endpoint\":\"https://c.prod.theintercept.com/a\",\"site_id\":\"theintercept.com\"},\"piano\":{\"application_id\":\"hsZyoAWmIE\",\"origin\":\"https://o.prod.theintercept.com\"},\"port\":8080,\"private_wp_origin\":\"https://wp.theintercept.com\",\"public_api_origin\":\"https://theintercept.com\",\"public_wp_origin\":\"https://theintercept.com\",\"request_timeout\":30000,\"set_headers\":{\"Access-Control-Allow-Origin\":\"*\",\"Referrer-Policy\":\"strict-origin-when-cross-origin\",\"Strict-Transport-Security\":\"max-age=63072000; includeSubDomains; preload\",\"Vary\":\"Accept-Encoding\",\"X-Content-Type-Options\":\"nosniff\",\"X-Frame-Options\":\"SAMEORIGIN\",\"X-Xss-Protection\":\"1; mode=block\"},\"site_prefix\":\"\"};\n window.__COUNTRY_CODE__ = \"US\";\n </script>\n \n \n <script type=\"application/ld+json\">\n {\"url\":\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"}\n </script>\n \n <div id=\"parsely-root\" style=\"display: none\">\n <div id=\"parsely-cfg\" data-parsely-site=\"theintercept.com\"></div>\n </div>\n \n <script src=\"/assets/app42e762a729b53f810f04.js\"></script>\n </body>\n</html>\n"
实际上你是在循环之外加载文件,而循环实际上是关闭的!所以你只需要装一个空汤!
此外,由于您要处理损坏的 HTML
,rel 等于 rel=\"canonical\"
,因此您必须注意这一点。或明确指定它或在选择器中使用 *
。
from bs4 import BeautifulSoup
with open('a.html') as f:
soup = BeautifulSoup(f.read(), 'lxml')
for i in soup.select('link[rel*=canonical]'):
print(i['href'])
输出:
\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/
我有很多 FILE 类型的文件(保存在您系统上的文件没有任何扩展名)。这些文件包含 HTML 已解析的新闻内容 - 网站。我需要找到隐藏在那里的规范 link (URL)。我正在使用此代码首先测试其中一个文件 -
with open(file, 'r') as f:
html_text = f.read()
soup = BeautifulSoup(html_text, 'html.parser')
link = soup.find('link', rel = 'canonical')
但我收到 NoneType 对象错误。我也尝试了这些变化
# Variation 1
link = soup.find('link', {'rel':'canonical'})
# Variation 2
link = soup.find('link', rel = 'canonical')['href']
# Variation 3
link = soup.find('link', {'rel':'canonical'}).get['href']
# Variation 4
link = soup.find('link', {'rel':'canonical'})['href']
我也尝试了 soup.find_all 变体,但这些也失败了。 (错误:NoneType 对象不是 subscriptable/NoneType 对象没有属性 href)
我通过在记事本中打开它来手动检查我的文件,我发现其中有一个片段 <link rel=\"canonical\" href=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/>
证明规范确实不是 NoneType 对象。
这似乎是一个如此简单的问题,但似乎有一些我无法捕捉到的错误。我在 Whosebug 上浏览了很多处理类似问题的问题,并尝试了他们的解决方案(因此有变体)。感谢任何帮助。
编辑 - 根据要求添加文件内容
"<!DOCTYPE html>\n<!--\n ______ __ ______ __ __\n/\__ _\/\ \ /\__ _\ /\ \__ /\ \__\n\/_/\ \/\ \ \___ __\/_/\ \/ ___\ \ ,_\ __ _ __ ___ __ _____\ \ ,_\\n \ \ \ \ \ _ `\ /'__`\ \ \ \ /' _ `\ \ \/ /'__`\/\`'__\/'___\ /'__`\/\ '__`\ \ \/\n \ \ \ \ \ \ \ \/\ __/ \_\ \__/\ \/\ \ \ \_/\ __/\ \ \//\ \__//\ __/\ \ \L\ \ \ \_\n \ \_\ \ \_\ \_\ \____\ /\_____\ \_\ \_\ \__\ \____\\ \_\\ \____\ \____\\ \ ,__/\ \__\\n \/_/ \/_/\/_/\/____/ \/_____/\/_/\/_/\/__/\/____/ \/_/ \/____/\/____/ \ \ \/ \/__/\n \ \_\\n \/_/\n-->\n<html lang=\"en\">\n <head>\n <title>The Intercept</title>\n <meta charset=\"utf-8\">\n <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no\">\n <meta name=\"msapplication-TileColor\" content=\"#000000\">\n <meta name=\"msapplication-TileImage\" content=\"/static/mstile-144x144.png\">\n <meta name=\"msapplication-config\" content=\"/static/browserconfig.xml\">\n <meta name=\"theme-color\" content=\"#ffffff\">\n <meta property=\"og:url\" content=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\">\n <link rel=\"apple-touch-icon\" sizes=\"57x57\" href=\"/static/apple-touch-icon-57x57.png\">\n <link rel=\"apple-touch-icon\" sizes=\"60x60\" href=\"/static/apple-touch-icon-60x60.png\">\n <link rel=\"apple-touch-icon\" sizes=\"72x72\" href=\"/static/apple-touch-icon-72x72.png\">\n <link rel=\"apple-touch-icon\" sizes=\"76x76\" href=\"/static/apple-touch-icon-76x76.png\">\n <link rel=\"apple-touch-icon\" sizes=\"114x114\" href=\"/static/apple-touch-icon-114x114.png\">\n <link rel=\"apple-touch-icon\" sizes=\"120x120\" href=\"/static/apple-touch-icon-120x120.png\">\n <link rel=\"apple-touch-icon\" sizes=\"144x144\" href=\"/static/apple-touch-icon-144x144.png\">\n <link rel=\"apple-touch-icon\" sizes=\"152x152\" href=\"/static/apple-touch-icon-152x152.png\">\n <link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"/static/apple-touch-icon-180x180.png\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-32x32.png\" sizes=\"32x32\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/android-chrome-192x192.png\" sizes=\"192x192\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-96x96.png\" sizes=\"96x96\">\n <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-16x16.png\" sizes=\"16x16\">\n <link rel=\"manifest\" href=\"/static/manifest.json\">\n <link rel=\"shortcut icon\" href=\"/static/favicon.ico\">\n <link rel=\"canonical\" href=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/>\n \n \n \n \n <!--[if !IE]><!--><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/app42e762a729b53f810f04.css\"><!--<![endif]-->\n <!--[if gte IE 9]><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/app42e762a729b53f810f04.css\"><![endif]-->\n <!--[if lte IE 8]><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/ie842e762a729b53f810f04.css\"><![endif]-->\n \n <!--[if lte IE 8]>\n <script>\n document.createElement('header');\n document.createElement('nav');\n document.createElement('section');\n document.createElement('article');\n document.createElement('aside');\n document.createElement('footer');\n document.createElement('hgroup');\n document.createElement('picture');\n </script>\n <![endif]-->\n <script id=\"ad-block-test\" src=\"/ads.js\" data-blocked=\"true\"></script>\n </head>\n <body>\n <script src=\"/assets/sniffer42e762a729b53f810f04.js\"></script>\n <div id=\"Root\"><div class=\"InterceptWrapper\" data-reactroot=\"\" data-reactid=\"1\" data-react-checksum=\"1442884202\"><div data-reactid=\"2\"><!-- react-empty: 3 --><!-- react-empty: 4 --></div><div class=\"Header Header--en Header--route-theintercept\" data-reactid=\"5\"><span data-reactid=\"6\"><div class=\"Header-hamburger\" data-reactid=\"7\"><a class=\"Header-hamburger-link\" style=\"color:;\" href=\"/2020/10/27/senator-perdue-ossoff-china/feed/?menu=1\" data-reactid=\"8\"><span class=\"Icon Icon--Menu icon-TI_Menu\" data-reactid=\"9\"></span></a></div><nav class=\"Header-menu\" data-reactid=\"10\"><div class=\"Logo\" data-reactid=\"11\"><div class=\"Logo-bg-block\" data-reactid=\"12\"><div class=\"GridContainer\" data-reactid=\"13\"><div class=\"GridRow\" data-reactid=\"14\"><div class=\"Logo-bg\" data-reactid=\"15\"></div></div></div></div><div class=\"Logo-block\" data-reactid=\"16\"><a href=\"/\" data-reactid=\"17\"><span class=\"Logo-fallback\" style=\"color:#111;\" data-reactid=\"18\"><!-- react-text: 19 -->The<!-- /react-text --><br data-reactid=\"20\"/><!-- react-text: 21 -->Intercept_<!-- /react-text --><span data-reactid=\"22\"><br data-reactid=\"23\"/><!-- react-text: 24 --><!-- /react-text --></span></span><svg class=\"Logo-svg\" height=\"50px\" version=\"1.1\" viewBox=\"0 0 140 50\" width=\"140px\" data-reactid=\"25\"><g data-reactid=\"26\"><path class=\"Logo-path\" d=\"M51.731,30.458c1.246,0,2.264,1.425,2.264,3.206l-4.605,0.56C49.517,31.781,50.28,30.458,51.731,30.458 M40.789,8.601 c1.247,0,2.265,1.424,2.265,3.206l-4.606,0.559C38.575,9.924,39.339,8.601,40.789,8.601 M92.774,30.458 c1.247,0,2.264,1.425,2.264,3.206l-4.605,0.56C90.56,31.781,91.323,30.458,92.774,30.458 M128.295,46.463H140v-2.188h-11.705 V46.463z M106.642,31.679c0.279-0.101,0.61-0.178,1.272-0.178c2.544,0,4.275,1.68,4.275,5.42c0,3.104-1.705,5.216-4.173,5.216 c-0.408,0-0.891-0.076-1.374-0.229V31.679z M68.652,33.206h3.18v-4.097c-0.61-0.254-1.017-0.356-1.603-0.356 c-0.992,0-2.188,0.662-3.435,1.654l-0.916,0.713v-2.367h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h9.313v-0.865 l-2.367-0.763v-9.39c0.611-0.229,1.68-0.407,1.934-0.407L68.652,33.206z M80.484,28.753c-3.995,0-8.372,2.494-8.372,8.066 c0,4.631,3.079,6.82,6.412,6.82c1.146,0,2.469-0.153,4.402-0.967l2.341-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.545,0.382 c-3.155,0-5.547-1.883-5.547-6.132c0-2.952,1.705-4.555,3.257-4.555c0.153,0,0.331,0.025,0.484,0.102l1.399,2.646h2.926v-3.613 C83.741,29.16,82.138,28.753,80.484,28.753 M123.792,26.489h-1.171l-5.521,3.18v1.069l1.857,0.178v10.076 c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636l-0.178-0.941c-0.33,0.076-0.916,0.101-1.246,0.101 c-1.858,0-3.079-0.687-3.079-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M40.128,26.489h-1.171l-5.522,3.18v1.069l1.858,0.178 v10.076c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636L44.453,41.4 c-0.331,0.076-0.916,0.101-1.247,0.101c-1.857,0-3.078-0.687-3.078-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M58.626,35.293 v-0.407c0-2.189-1.476-6.133-6.437-6.133c-3.766,0-7.558,2.494-7.558,8.066c0,4.555,3.334,6.82,6.972,6.82 c1.069,0,2.621-0.178,4.428-0.967l2.264-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.596,0.382c-2.799,0-5.878-1.628-6.005-5.852 H58.626z M99.67,35.293v-0.407c0-2.189-1.476-6.133-6.438-6.133c-3.766,0-7.557,2.494-7.557,8.066c0,4.555,3.333,6.82,6.972,6.82 c1.068,0,2.62-0.178,4.427-0.967l2.265-0.993l-0.331-0.916c-0.585,0.229-1.4,0.382-2.596,0.382c-2.798,0-5.877-1.628-6.005-5.852 H99.67z M47.685,13.435v-0.407c0-2.188-1.476-6.132-6.438-6.132c-3.766,0-7.557,2.493-7.557,8.066c0,4.555,3.333,6.819,6.972,6.819 c1.069,0,2.621-0.178,4.427-0.967l2.265-0.992l-0.331-0.916c-0.585,0.229-1.399,0.382-2.595,0.382 c-2.799,0-5.878-1.629-6.005-5.853H47.685z M6.438,25.598v15.725L3.97,42.265v0.992h10.865v-0.992l-2.468-0.942V25.598l2.468-0.942 v-0.992H3.97v0.992L6.438,25.598z M31.781,41.629V33.74c0-2.926-1.094-4.987-4.045-4.987c-1.222,0-2.138,0.382-3.334,0.916 l-2.163,0.942v-1.858h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h8.804v-0.865l-1.858-0.763v-9.924 c0.458-0.127,1.247-0.204,1.833-0.204c1.577,0,2.875,0.865,2.875,3.461v6.667l-1.858,0.763v0.865h8.804v-0.865L31.781,41.629z M106.642,28.753h-0.662l-6.285,1.272v0.967l2.112,0.637v15.598l-2.112,0.763v0.865h9.567V47.99l-2.62-0.763v-3.588h0.992 c5.954,0,9.287-3.614,9.287-8.626c0-4.301-3.002-6.26-5.241-6.26c-0.891,0-2.443,0.535-3.868,1.247l-0.967,0.483h-0.203V28.753z M31.527,19.771v-7.888c0-2.926-1.094-4.987-4.046-4.987c-1.221,0-2.137,0.381-3.333,0.916l-2.163,0.941V0h-0.662l-4.911,1.807H0 v4.911h2.316L3.257,3.69h3.181v15.776L3.97,20.407V21.4h10.61v-0.993l-2.213-0.941V3.69h4.783v16.081l-1.857,0.763V21.4h8.55 v-0.866l-1.858-0.763V9.873c0.458-0.127,1.247-0.229,1.832-0.229c1.578,0,2.875,0.865,2.875,3.46v6.667l-1.857,0.763V21.4h8.804 v-0.866L31.527,19.771z\" fill=\"#111\" data-reactid=\"27\"></path></g></svg></a></div></div><ul class=\"Header-language-list\" data-reactid=\"28\"><li class=\"Header-language-list-item Header-language-list-item--active\" data-reactid=\"29\"><a class=\"Header-language-link\" href=\"/\" data-reactid=\"30\">English</a></li><li class=\"Header-language-list-item\" data-reactid=\"31\"><a class=\"Header-language-link\" href=\"/brasil/\" data-reactid=\"32\">Portugu\u00eas</a></li></ul><div class=\"Header-search-block\" data-reactid=\"33\"><form action=\"/search\" type=\"get\" data-reactid=\"34\"><label class=\"Header-search-label\" for=\"search\" data-reactid=\"35\"><span class=\"Icon Icon--Search icon-TI_Search\" data-reactid=\"36\"></span></label><input id=\"search\" class=\"Header-search-input\" name=\"s\" data-reactid=\"37\"/></form></div><div class=\"Header-menu-mission-block\" data-reactid=\"38\"><ul class=\"Header-menu-list Header-menu-list--collection-items\" data-reactid=\"39\"><li class=\"Header-menu-list-item\" data-reactid=\"40\"><a class=\"Header-menu-link\" href=\"/politics/\" data-reactid=\"41\">Politics</a></li><li class=\"Header-menu-list-item\" data-reactid=\"42\"><a class=\"Header-menu-link\" href=\"/justice/\" data-reactid=\"43\">Justice</a></li><li class=\"Header-menu-list-item\" data-reactid=\"44\"><a class=\"Header-menu-link\" href=\"/national-security/\" data-reactid=\"45\">National Security</a></li><li class=\"Header-menu-list-item\" data-reactid=\"46\"><a class=\"Header-menu-link\" href=\"/world/\" data-reactid=\"47\">World</a></li><li class=\"Header-menu-list-item\" data-reactid=\"48\"><a class=\"Header-menu-link\" href=\"/technology/\" data-reactid=\"49\">Technology</a></li><li class=\"Header-menu-list-item\" data-reactid=\"50\"><a class=\"Header-menu-link\" href=\"/environment/\" data-reactid=\"51\">Environment</a></li></ul></div><div class=\"Header-menu-mission-block\" data-reactid=\"52\"><ul class=\"Header-menu-list Header-menu-list--mission-items\" data-reactid=\"53\"><li class=\"Header-menu-list-item\" data-reactid=\"54\"><a class=\"Header-menu-link\" href=\"/special-investigations/\" data-reactid=\"55\">Special Investigations</a></li><li class=\"Header-menu-list-item\" data-reactid=\"56\"><a class=\"Header-menu-link\" href=\"/voices/\" data-reactid=\"57\">Voices</a></li><li class=\"Header-menu-list-item\" data-reactid=\"58\"><a class=\"Header-menu-link\" href=\"/podcasts/\" data-reactid=\"59\">Podcasts</a></li><li class=\"Header-menu-list-item\" data-reactid=\"60\"><a class=\"Header-menu-link\" href=\"/videos/\" data-reactid=\"61\">Videos</a></li><li class=\"Header-menu-list-item\" data-reactid=\"62\"><a class=\"Header-menu-link\" href=\"/documents/\" data-reactid=\"63\">Documents</a></li><li class=\"Header-menu-list-item\" data-reactid=\"64\"><a class=\"Header-menu-link\" href=\"https://join.theintercept.com/donate/now?source=web_intercept_20200601_hamburger\" data-reactid=\"65\"><div class=\"Header-menu-list-item-button\" data-reactid=\"66\"><!-- react-text: 67 -->Become A Member<!-- /react-text --><span class=\"Icon Icon--Arrow_02_Right icon-TI_Arrow_02_Right\" data-reactid=\"68\"></span></div></a></li></ul></div><ul class=\"Header-menu-list Header-menu-list--content-items\" data-reactid=\"69\"><li class=\"Header-menu-list-item\" data-reactid=\"70\"><a class=\"Header-menu-link\" href=\"/about/\" data-reactid=\"71\">About</a></li><li class=\"Header-menu-list-item\" data-reactid=\"72\"><a class=\"Header-menu-link\" href=\"/policies/\" data-reactid=\"73\">Editorial Policies</a></li><li class=\"Header-menu-list-item\" data-reactid=\"74\"><a class=\"Header-menu-link\" href=\"/source/\" data-reactid=\"75\">Become a Source</a></li><li class=\"Header-menu-list-item\" data-reactid=\"76\"><a class=\"Header-menu-link\" href=\"/newsletter/?source=web_hamburger\" data-reactid=\"77\">Join Newsletter</a></li></ul><div class=\"Header-footer\" data-reactid=\"78\"><div class=\"Header-social-links\" data-reactid=\"79\"><a class=\"Header-social-link\" target=\"_blank\" data-label=\"facebook\" href=\"https://www.facebook.com/theinterceptflm\" data-reactid=\"80\"><span class=\"Icon Icon--Facebook icon-TI_Facebook\" data-reactid=\"81\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"twitter\" href=\"https://twitter.com/theintercept\" data-reactid=\"82\"><span class=\"Icon Icon--Twitter icon-TI_Twitter\" data-reactid=\"83\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"instagram\" href=\"https://www.instagram.com/theintercept/\" data-reactid=\"84\"><span class=\"Icon Icon--Instagram icon-TI_Instagram\" data-reactid=\"85\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"tumblr\" href=\"https://the-intercept.tumblr.com\" data-reactid=\"86\"><span class=\"Icon Icon--Tumblr icon-TI_Tumblr\" data-reactid=\"87\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"snapchat\" href=\"https://www.snapchat.com/add/theintercept\" data-reactid=\"88\"><span class=\"Icon Icon--Snapchat icon-TI_Snapchat\" data-reactid=\"89\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"flipboard\" href=\"https://flipboard.com/@TheIntercept\" data-reactid=\"90\"><span class=\"Icon Icon--Flipboard icon-TI_Flipboard\" data-reactid=\"91\"></span></a><a class=\"Header-social-link\" data-label=\"rss\" href=\"/feeds/\" data-reactid=\"92\"><span class=\"Icon Icon--RSS icon-TI_RSS\" data-reactid=\"93\"></span></a></div><img class=\"Header-FLM-svg\" src=\"/static/FLM.svg\" alt=\"First Look Media logo\" data-reactid=\"94\"/><p class=\"Header-TM\" data-reactid=\"95\">The Intercept is a First Look Media Company.</p><cite class=\"Header-copyright\" data-reactid=\"96\"><!-- react-text: 97 -->\u00a9 First Look Media. <!-- /react-text --><span data-reactid=\"98\">All rights reserved</span></cite><ul class=\"Header-footer-list\" data-reactid=\"99\"><li class=\"Header-footer-list-item\" data-reactid=\"100\"><a class=\"Header-footer-link\" data-label=\"terms of use\" href=\"/terms-use/\" data-reactid=\"101\"><span data-reactid=\"102\">Terms of use</span></a></li><li class=\"Header-footer-list-item\" data-reactid=\"103\"><a class=\"Header-footer-link\" data-label=\"privacy policy\" href=\"/privacy-policy/\" data-reactid=\"104\"><span data-reactid=\"105\">Privacy</span></a></li></ul></div></nav></span></div><div class=\"ErrorPage\" data-reactid=\"106\"><div class=\"Logo\" data-reactid=\"107\"><div class=\"Logo-block\" data-reactid=\"108\"><a href=\"/\" data-reactid=\"109\"><span class=\"Logo-fallback\" style=\"color:#fff;\" data-reactid=\"110\"><!-- react-text: 111 -->The<!-- /react-text --><br data-reactid=\"112\"/><!-- react-text: 113 -->Intercept_<!-- /react-text --><span data-reactid=\"114\"><br data-reactid=\"115\"/><!-- react-text: 116 --><!-- /react-text --></span></span><svg class=\"Logo-svg\" height=\"50px\" version=\"1.1\" viewBox=\"0 0 140 50\" width=\"140px\" data-reactid=\"117\"><g data-reactid=\"118\"><path class=\"Logo-path\" d=\"M51.731,30.458c1.246,0,2.264,1.425,2.264,3.206l-4.605,0.56C49.517,31.781,50.28,30.458,51.731,30.458 M40.789,8.601 c1.247,0,2.265,1.424,2.265,3.206l-4.606,0.559C38.575,9.924,39.339,8.601,40.789,8.601 M92.774,30.458 c1.247,0,2.264,1.425,2.264,3.206l-4.605,0.56C90.56,31.781,91.323,30.458,92.774,30.458 M128.295,46.463H140v-2.188h-11.705 V46.463z M106.642,31.679c0.279-0.101,0.61-0.178,1.272-0.178c2.544,0,4.275,1.68,4.275,5.42c0,3.104-1.705,5.216-4.173,5.216 c-0.408,0-0.891-0.076-1.374-0.229V31.679z M68.652,33.206h3.18v-4.097c-0.61-0.254-1.017-0.356-1.603-0.356 c-0.992,0-2.188,0.662-3.435,1.654l-0.916,0.713v-2.367h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h9.313v-0.865 l-2.367-0.763v-9.39c0.611-0.229,1.68-0.407,1.934-0.407L68.652,33.206z M80.484,28.753c-3.995,0-8.372,2.494-8.372,8.066 c0,4.631,3.079,6.82,6.412,6.82c1.146,0,2.469-0.153,4.402-0.967l2.341-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.545,0.382 c-3.155,0-5.547-1.883-5.547-6.132c0-2.952,1.705-4.555,3.257-4.555c0.153,0,0.331,0.025,0.484,0.102l1.399,2.646h2.926v-3.613 C83.741,29.16,82.138,28.753,80.484,28.753 M123.792,26.489h-1.171l-5.521,3.18v1.069l1.857,0.178v10.076 c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636l-0.178-0.941c-0.33,0.076-0.916,0.101-1.246,0.101 c-1.858,0-3.079-0.687-3.079-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M40.128,26.489h-1.171l-5.522,3.18v1.069l1.858,0.178 v10.076c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636L44.453,41.4 c-0.331,0.076-0.916,0.101-1.247,0.101c-1.857,0-3.078-0.687-3.078-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M58.626,35.293 v-0.407c0-2.189-1.476-6.133-6.437-6.133c-3.766,0-7.558,2.494-7.558,8.066c0,4.555,3.334,6.82,6.972,6.82 c1.069,0,2.621-0.178,4.428-0.967l2.264-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.596,0.382c-2.799,0-5.878-1.628-6.005-5.852 H58.626z M99.67,35.293v-0.407c0-2.189-1.476-6.133-6.438-6.133c-3.766,0-7.557,2.494-7.557,8.066c0,4.555,3.333,6.82,6.972,6.82 c1.068,0,2.62-0.178,4.427-0.967l2.265-0.993l-0.331-0.916c-0.585,0.229-1.4,0.382-2.596,0.382c-2.798,0-5.877-1.628-6.005-5.852 H99.67z M47.685,13.435v-0.407c0-2.188-1.476-6.132-6.438-6.132c-3.766,0-7.557,2.493-7.557,8.066c0,4.555,3.333,6.819,6.972,6.819 c1.069,0,2.621-0.178,4.427-0.967l2.265-0.992l-0.331-0.916c-0.585,0.229-1.399,0.382-2.595,0.382 c-2.799,0-5.878-1.629-6.005-5.853H47.685z M6.438,25.598v15.725L3.97,42.265v0.992h10.865v-0.992l-2.468-0.942V25.598l2.468-0.942 v-0.992H3.97v0.992L6.438,25.598z M31.781,41.629V33.74c0-2.926-1.094-4.987-4.045-4.987c-1.222,0-2.138,0.382-3.334,0.916 l-2.163,0.942v-1.858h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h8.804v-0.865l-1.858-0.763v-9.924 c0.458-0.127,1.247-0.204,1.833-0.204c1.577,0,2.875,0.865,2.875,3.461v6.667l-1.858,0.763v0.865h8.804v-0.865L31.781,41.629z M106.642,28.753h-0.662l-6.285,1.272v0.967l2.112,0.637v15.598l-2.112,0.763v0.865h9.567V47.99l-2.62-0.763v-3.588h0.992 c5.954,0,9.287-3.614,9.287-8.626c0-4.301-3.002-6.26-5.241-6.26c-0.891,0-2.443,0.535-3.868,1.247l-0.967,0.483h-0.203V28.753z M31.527,19.771v-7.888c0-2.926-1.094-4.987-4.046-4.987c-1.221,0-2.137,0.381-3.333,0.916l-2.163,0.941V0h-0.662l-4.911,1.807H0 v4.911h2.316L3.257,3.69h3.181v15.776L3.97,20.407V21.4h10.61v-0.993l-2.213-0.941V3.69h4.783v16.081l-1.857,0.763V21.4h8.55 v-0.866l-1.858-0.763V9.873c0.458-0.127,1.247-0.229,1.832-0.229c1.578,0,2.875,0.865,2.875,3.46v6.667l-1.857,0.763V21.4h8.804 v-0.866L31.527,19.771z\" fill=\"#fff\" data-reactid=\"119\"></path></g></svg></a></div></div><div class=\"GridContainer\" data-reactid=\"120\"><div class=\"GridRow\" data-reactid=\"121\"><div class=\"ErrorPage-container\" data-reactid=\"122\"><h2 class=\"ErrorPage-pagetitle\" data-reactid=\"123\">Error 404</h2><h1 class=\"ErrorPage-title\" data-reactid=\"124\">Page not found</h1><p class=\"ErrorPage-text\" data-reactid=\"125\"><!-- react-text: 126 -->We couldn\u2019t find anything at this address. Please check the URL or go to the <!-- /react-text --><a href=\"/\" data-reactid=\"127\">homepage</a><!-- react-text: 128 -->.<!-- /react-text --></p></div></div></div></div><!-- react-empty: 129 --><div style=\"display:none;\" data-reactid=\"130\"><svg\n xmlns=\"http://www.w3.org/2000/svg\"\n xmlns:xlink=\"http://www.w3.org/1999/xlink\"\n height=\"500\"\n width=\"500\"\n viewBox=\"0 0 500 500\"\n aria-labelledby=\"title desc\"\n>\n <title id=\"title\">Filters SVG</title>\n <defs>\n <filter id=\"bleed\" filterUnits=\"objectBoundingBox\">\n <feColorMatrix\n type=\"matrix\"\n values=\"1 0 0 0 0, 0 0.15 0 0 0, 0 0 .20 0 0, 0 0 0 1 0\"\n />\n </filter>\n </defs>\n</svg></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-top\" data-reactid=\"131\"></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-takeover\" data-reactid=\"132\"></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-bottom\" data-reactid=\"133\"></div></div></div>\n <script>\n window.initialStoreTree = {\"bodyClasses\":[],\"categoryPostIDs\":{},\"commentsExpanded\":{},\"contentLanguage\":null,\"dispatcher\":{\"backend\":null,\"node\":null,\"type\":null},\"documentCloud\":{\"document\":{},\"embedUrl\":\"\",\"text\":{}},\"documentIDs\":[],\"error\":{\"message\":\"Page not found\",\"status\":404},\"featureIDs\":[],\"featuresLanguage\":{},\"googleAMPUrl\":\"\",\"hamburgerColor\":null,\"host\":\"theintercept.com\",\"inInitialRender\":false,\"languageLanding\":{},\"liveBlogsUpdatesIDs\":{},\"loading\":false,\"mediaPlayer\":null,\"newsletter\":{\"form\":{\"description\":\"\",\"status\":\"\"}},\"podcastPage\":{},\"podcastsHomepage\":{\"speakingIDs\":[]},\"postLanding\":{\"redirect\":null},\"postsMetaIDs\":{},\"resources\":{\"alerts\":{},\"annotationSets\":{},\"annotations\":{},\"categories\":{},\"comments\":{},\"documents\":{},\"liveBlogs\":{},\"platform\":{\"theintercept\":{\"Article\":{},\"Author\":{},\"Document\":{},\"GeoLocation\":{},\"HttpReturn\":{},\"Podcast\":{},\"PodcastEpisode\":{},\"Section\":{},\"bySpeakingID\":{},\"documentArchives\":{},\"documentReleases\":{},\"nodePromos\":{\"default\":{}}},\"theintercept-brasil\":{\"HttpReturn\":{},\"bySpeakingID\":{},\"latestPromos\":[],\"nodePromos\":{\"default\":{}}}},\"postCommentMeta\":{},\"posts\":{},\"promoBanners\":{},\"series\":{},\"seriesDocuments\":{},\"staff\":{},\"taxonomies\":{},\"timeline\":{}},\"reverseChronIDs\":{},\"route\":{\"names\":[],\"params\":{},\"path\":\"\",\"pathname\":\"\",\"query\":{}},\"routed\":false,\"scrollToPostComments\":null,\"searchResultIDs\":[],\"seriesHomepage\":{\"curatedItems\":{\"postIds\":[],\"seriesSlugs\":[]},\"recentItems\":{\"postIds\":[],\"seriesSlugs\":[]}},\"seriesPostIDs\":{},\"sidToday\":{\"search\":{\"lastCursor\":null,\"loading\":false,\"speakingIDs\":[],\"totalCount\":null}},\"sidTodayFilesUpdateReports\":{\"category\":{}},\"specialSeriesItems\":[],\"squirrelDocumentIDs\":[],\"squirrelIDs\":[],\"staffIDs\":[],\"surveillanceCatalogData\":null,\"surveillanceCatalogVendors\":null,\"tocChapters\":{},\"tracking\":{\"currentUrl\":\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\",\"previousUrl\":null}};\n window.config = {\"assets\":{\"host\":\"\",\"webpack\":false},\"aws_static\":\"https://static.theintercept.com\",\"coral_talk_api_origin\":\"https://talk.theintercept.com\",\"coral_talk_origin\":\"https://talk.theintercept.com\",\"coral_talk_permalink_cutover_date\":\"2019-06-24T14:00:00.000Z\",\"donation_base_url\":\"https://join.theintercept.com/donate/\",\"donation_base_url_brasil\":\"http://catarse.me/intercept/\",\"env\":{\"NODE_ENV\":\"production\"},\"facebook\":{\"tracking_pixel_id\":\"2151258874911575\"},\"google\":{\"id\":\"UA-79475609-15\"},\"graphql_realm_id\":\"UmVhbG06NTJiOWMwOGEtMjQwYS00NzMxLThlYTAtMjMyY2RiYTYwNzBh\",\"graphql_realm_id__brasil\":\"UmVhbG1Db250ZW50OjE3NTU4MzMyLWUwZTQtNDIyYy1iNDcyLWZkMDQzMmRiOGRhYw==\",\"graphql_url\":\"http://read.usq.flmcloud.local:3002/graphql\",\"hash\":\"42e762a729b53f810f04\",\"host\":\"theintercept.com\",\"imgix\":{\"additional_origins\":[\"https://firstlook.org\"],\"domain\":\"theintercept.imgix.net\"},\"logs\":{\"level\":\"info\"},\"onsite_origins\":[\"https://theintercept.com\"],\"origin\":\"https://theintercept.com\",\"override_private_wp_host\":\"theintercept.com\",\"parsely\":{\"endpoint\":\"https://c.prod.theintercept.com/a\",\"site_id\":\"theintercept.com\"},\"piano\":{\"application_id\":\"hsZyoAWmIE\",\"origin\":\"https://o.prod.theintercept.com\"},\"port\":8080,\"private_wp_origin\":\"https://wp.theintercept.com\",\"public_api_origin\":\"https://theintercept.com\",\"public_wp_origin\":\"https://theintercept.com\",\"request_timeout\":30000,\"set_headers\":{\"Access-Control-Allow-Origin\":\"*\",\"Referrer-Policy\":\"strict-origin-when-cross-origin\",\"Strict-Transport-Security\":\"max-age=63072000; includeSubDomains; preload\",\"Vary\":\"Accept-Encoding\",\"X-Content-Type-Options\":\"nosniff\",\"X-Frame-Options\":\"SAMEORIGIN\",\"X-Xss-Protection\":\"1; mode=block\"},\"site_prefix\":\"\"};\n window.__COUNTRY_CODE__ = \"US\";\n </script>\n \n \n <script type=\"application/ld+json\">\n {\"url\":\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"}\n </script>\n \n <div id=\"parsely-root\" style=\"display: none\">\n <div id=\"parsely-cfg\" data-parsely-site=\"theintercept.com\"></div>\n </div>\n \n <script src=\"/assets/app42e762a729b53f810f04.js\"></script>\n </body>\n</html>\n"
实际上你是在循环之外加载文件,而循环实际上是关闭的!所以你只需要装一个空汤!
此外,由于您要处理损坏的 HTML
,rel 等于 rel=\"canonical\"
,因此您必须注意这一点。或明确指定它或在选择器中使用 *
。
from bs4 import BeautifulSoup
with open('a.html') as f:
soup = BeautifulSoup(f.read(), 'lxml')
for i in soup.select('link[rel*=canonical]'):
print(i['href'])
输出:
\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/