如何将 http get 响应从 html 转换为 json 格式(来自 kaggle.com)

How to convert the http get response from html to json format (from kaggle.com)

我尝试使用以下代码从 kaggle.com 获取 http 响应。 Kaggle 响应采用 html 格式,我想将其转换为 json 格式以便于进一步处理。

import requests
import json

username = 'vgtgayan'
base_url = 'https://www.kaggle.com/'
url = base_url+str(username)

r = requests.get(url)
print(r.status_code)
print(r.r.headers["content-type"])

输出:

200
text/html; charset=utf-8

以上代码适用于以下所有方法。

尝试 1:

r.json()

错误:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

尝试 2:

import xmltodict
import xmltojson

json_ = xmltojson.parse(r.text)

错误:

ExpatError: not well-formed (invalid token): line 22, column 70

尝试 3:

import xml.etree.ElementTree
dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))

错误:

ParseError: not well-formed (invalid token):

参考:

以上每一次尝试都以错误告终。 请帮我完成这个任务。

下面是我收到的 html 回复,

<!DOCTYPE html>

<html lang="en">



<head>

  <title>V.G.T. Gayan | Novice | Kaggle</title>

  <meta charset="utf-8" />

  <meta name="robots" content="index, follow" />

  <meta name="description" content="Kaggle profile for V.G.T. Gayan" />

  <meta name="turbolinks-cache-control" content="no-cache" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0">

  <meta name="theme-color" content="#008ABC" />

  <script nonce="VmwzDvxUO596o90cb7qq6Q==" type="text/javascript">

    window["pageRequestStartTime"] = 1641792835393;

    window["pageRequestEndTime"] = 1641792835435;

    window["initialPageLoadStartTime"] = new Date().getTime();

  </script>

  <link rel="preconnect" href="https://www.google-analytics.com" crossorigin="anonymous" /><link rel="preconnect" href="https://stats.g.doubleclick.net" /><link rel="preconnect" href="https://storage.googleapis.com" /><link rel="preconnect" href="https://apis.google.com" />

  <link href="/static/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />

  <link rel="manifest" href="/static/json/manifest.json" crossorigin="use-credentials">





  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />

  <link href="https://fonts.googleapis.com/icon?family=Google+Material+Icons"

    rel="preload" as="style" />

  <link href="https://fonts.googleapis.com/css?family=Inter:400,400i,500,500i,600,600i,700,700i"

    rel="preload" as="style" />

  <link href="https://fonts.googleapis.com/icon?family=Google+Material+Icons"

    rel="stylesheet" media="print" id="async-google-font-1" />

  <link href="https://fonts.googleapis.com/css?family=Inter:400,400i,500,500i,600,600i,700,700i"

    rel="stylesheet" media="print" id="async-google-font-2" />

  <script nonce="VmwzDvxUO596o90cb7qq6Q==" type="text/javascript">

    const styleSheetIds = ["async-google-font-1", "async-google-font-2"];

    styleSheetIds.forEach(function (id) {

      document.getElementById(id).addEventListener("load", function() {

        this.media = "all";

      });

    });

  </script>



    <link rel="stylesheet" type="text/css" href="/static/assets/vendor.css?v=a39c9d14b7e6072d0f7a" />

    <link rel="stylesheet" type="text/css" href="/static/assets/app.css?v=453de5392c911bfbccf0" />

  

    

 

      <script nonce="VmwzDvxUO596o90cb7qq6Q==">

        try{(function(a,s,y,n,c,h,i,d,e){d=s.createElement("style");

        d.appendChild(s.createTextNode(""));s.head.appendChild(d);d=d.sheet;

        y=y.map(x => d.insertRule(x + "{ opacity: 0 !important }"));

        h.start=1*new Date;h.end=i=function(){y.forEach(x => x<d.cssRules.length ? d.deleteRule(x) : {})};

        (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;

        })(window,document,['.site-header-react__nav'],'dataLayer',2000,{'GTM-52LNT9S':true});}catch(ex){}

    </script>

    <script nonce="VmwzDvxUO596o90cb7qq6Q==">

        window.dataLayer = window.dataLayer || [];

        function gtag() { dataLayer.push(arguments); }

        gtag('js', new Date());

        gtag('config', 'UA-12629138-1', {

            'optimize_id': 'GTM-52LNT9S',

            'displayFeaturesTask': null,

            'send_page_view': false,

            'content_group1': 'Users'

        });

    </script>

    <script nonce="VmwzDvxUO596o90cb7qq6Q==" async src="https://www.googletagmanager.com/gtag/js?id=UA-12629138-1"></script>



  

    

    <meta property="og:url" content="https://www.kaggle.com/vgtgayan/home" />

    <meta property="og:title" content="V.G.T. Gayan | Novice" />

    <meta property="og:description" content="Kaggle profile for V.G.T. Gayan" />

    <meta property="og:type" content="profile" />

    <meta property="og:username" content="vgtgayan" />

    <meta name="twitter:card" content="summary" /> 

    <meta name="twitter:image" content="https://www.kaggle.com/static/images/tiers/Novice@192.png" />

    <meta name="twitter:image:alt" content="Novice" />





  <meta name="twitter:site" content="@Kaggle" /> 

  

    



  

    



  

    

    

<script nonce="VmwzDvxUO596o90cb7qq6Q==" type="text/javascript">

    var Kaggle = window.Kaggle || {};



    Kaggle.Current = {

        antiForgeryToken: 'CfDJ8LdUzqlsSWBPr4Ce3rb9VL9za2FFB-rFm9iuPAm8PKLq9TqJiAHT4UmlnyzfLtHBrGjL5o2brrzKiVOqcKYybNKzrGn4e1wSl09TIhcFoh_3ivZ3ndFbLCUM9zaxUWdhs8JzSLYm5l9RM1a-bbh55aw',

        isAnonymous: true,

        analyticsToken: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NDE3OTM3MzUsIlVzZXJJZCI6MH0.vEUaJcUG88jwDqImiQoQtQU8-jui4Gdp6xihpLlls0U',

        analyticsTokenExpiry: 15,

        

        

        

        

        

        

        

        enableRapidash: true, 

    }

        Kaggle.Current.log = function(){};

        Kaggle.Current.warn = function(){};



    var decodeUserDisplayName = function () {

        var escapedUserDisplayName = Kaggle.Current.userDisplayNameEscaped || "";

        try {

            var textVersion = new DOMParser().parseFromString(escapedUserDisplayName, "text/html").documentElement.textContent;

            if (textVersion) {

                return textVersion;

            }

        } catch(ex) {}

        return escapedUserDisplayName;

    }

    Kaggle.Current.userDisplayName = decodeUserDisplayName();

</script>



    



<script nonce="VmwzDvxUO596o90cb7qq6Q==" type="text/javascript">

    var Kaggle = window.Kaggle || {};

    Kaggle.PageMessages = [];

</script>







    <script nonce="VmwzDvxUO596o90cb7qq6Q==">window['useKaggleAnalytics'] = true;</script>



  <script id="gapi-target" nonce="VmwzDvxUO596o90cb7qq6Q==" src="https://apis.google.com/js/api.js" defer

    async></script>

  <script nonce="VmwzDvxUO596o90cb7qq6Q==" src="/static/assets/runtime.js?v=f84cf2a36564689cb2b3" data-turbolinks-track="reload"></script>

  <script nonce="VmwzDvxUO596o90cb7qq6Q==" src="/static/assets/vendor.js?v=26a3487f6f4d622267d2" data-turbolinks-track="reload"></script>

  <script nonce="VmwzDvxUO596o90cb7qq6Q==" src="/static/assets/app.js?v=36d0e4e3f41dd27e5515" data-turbolinks-track="reload"></script>

    <script nonce="VmwzDvxUO596o90cb7qq6Q==" type="text/javascript">

      window.kaggleStackdriverConfig = {

        key: 'AIzaSyA4eNqUdRRskJsCZWVz-qL655Xa5JEMreE',

        projectId: 'kaggle-161607',

        service: 'web-fe',

        version: 'ci',

        userId: '0'

      }

    </script>

</head>



<body data-turbolinks="false">

  <main>

    













<div id="site-container"></div>

<div data-component-name="NavigationContainer" style="display: flex; flex-direction: column; flex: 1 0 auto;"></div><script class="kaggle-component" nonce="VmwzDvxUO596o90cb7qq6Q==">var Kaggle=window.Kaggle||{};Kaggle.State=Kaggle.State||[];Kaggle.State.push({"navigationType":"BOTH_NAV"});performance && performance.mark && performance.mark("NavigationContainer.componentCouldBootstrap");</script>

<div id="site-body" class="hide">

    

<div data-component-name="ProfileContainerReact" style="display: flex; flex-direction: column; flex: 1 0 auto;"></div><script class="kaggle-component" nonce="VmwzDvxUO596o90cb7qq6Q==">var Kaggle=window.Kaggle||{};Kaggle.State=Kaggle.State||[];Kaggle.State.push({"userId":8257495,"displayName":"V.G.T. Gayan","country":"Sri Lanka","region":"Western Province","city":"Colombo","gitHubUserName":null,"twitterUserName":null,"linkedInUrl":null,"websiteUrl":null,"occupation":"Research \u0026 Development Engineer","organization":"Synopsys Inc","bio":null,"userLastActive":"2022-01-10T03:58:03.07Z","userJoinDate":"2021-09-01T11:53:55.643Z","performanceTier":"novice","performanceTierCategory":"competitions","activePaneTier":"novice","activePaneCategory":"unspecified","userUrl":"/vgtgayan","userAvatarUrl":"https://storage.googleapis.com/kaggle-avatars/images/default-thumb.png","email":null,"canEdit":false,"canCreateDatasets":true,"userName":"vgtgayan","activePane":"home","totalDatasets":0,"totalOrganizations":0,"competitionsSummary":{"tier":"novice","totalResults":0,"rankPercentage":0.9691485,"rankOutOf":173152,"rankCurrent":null,"rankHighest":null,"totalGoldMedals":0,"totalSilverMedals":0,"totalBronzeMedals":0,"highlights":[],"summaryType":"competitions"},"scriptsSummary":{"tier":"novice","totalResults":1,"rankPercentage":0.1549871,"rankOutOf":203004,"rankCurrent":null,"rankHighest":null,"totalGoldMedals":0,"totalSilverMedals":0,"totalBronzeMedals":0,"highlights":[],"summaryType":"notebooks"},"discussionsSummary":{"tier":"novice","totalResults":0,"rankPercentage":0.022529345,"rankOutOf":267429,"rankCurrent":null,"rankHighest":null,"totalGoldMedals":0,"totalSilverMedals":0,"totalBronzeMedals":0,"highlights":[],"summaryType":"discussion"},"datasetsSummary":{"tier":"novice","totalResults":0,"rankPercentage":0.13677031,"rankOutOf":54098,"rankCurrent":null,"rankHighest":null,"totalGoldMedals":0,"totalSilverMedals":0,"totalBronzeMedals":0,"highlights":[],"summaryType":"datasets"},"pageMessages":null,"followers":{"type":"following","count":0,"list":[],"containsSelf":false,"maxCountReached":false},"following":{"type":"following","count":0,"list":[],"containsSelf":false,"maxCountReached":false},"canSeeFollowers":false,"canSeeCallToAction":true,"canSeeNotifications":true,"canSeeAtMentions":true,"totalScripts":1,"isAdmin":false,"isEditing":false,"userAllowsUserMessages":true,"@wf": "Users.Models.ProfileDtoWireFormat"});performance && performance.mark && performance.mark("ProfileContainerReact.componentCouldBootstrap");</script>

<script nonce="VmwzDvxUO596o90cb7qq6Q==" type="text/x-mathjax-config">

    MathJax.Hub.Config({

    "HTML-CSS": {

    preferredFont: "TeX",

    availableFonts: ["STIX", "TeX"],

    linebreaks: {

    automatic: true

    },

    EqnChunk: (MathJax.Hub.Browser.isMobile ? 10 : 50)

    },

    tex2jax: {

    inlineMath: [["\(", "\)"], ["\\(", "\\)"]],

    displayMath: [["$$", "$$"], ["\[", "\]"]],

    processEscapes: true,

    ignoreClass: "tex2jax_ignore|dno"

    },

    TeX: {

    noUndefined: {

    attributes: {

    mathcolor: "red",

    mathbackground: "#FFEEEE",

    mathsize: "90%"

    }

    }

    },

    Macros: {

    href: "{}"

    },

    skipStartupTypeset: true,

    messageStyle: "none",

    extensions: [],

    });

</script>

<script type="text/javascript" nonce="VmwzDvxUO596o90cb7qq6Q==">

  window.addEventListener("DOMContentLoaded", () => {

    const head = document.getElementsByTagName("head")[0];

    const useProdHosts = ["www.kaggle.com", "admin.kaggle.com"];

    const subdomain = useProdHosts.includes(window.location.hostname) ? "www" : "staging";



    const lib = document.createElement("script");

    lib.type = "text/javascript";

    lib.src = `https://${subdomain}.kaggleusercontent.com/static/mathjax/2.7.9/MathJax.js?config=TeX-AMS-MML_HTMLorMML`;

    head.appendChild(lib);

  });

</script>





</div>









  </main>

</body>



</html>

你可以使用html_to_json

import requests
import json
import html_to_json
username = 'vgtgayan'
base_url = 'https://www.kaggle.com/'
url = base_url+str(username)
r = requests.get(url)
html_response = requests.get(url)
html_string = html_response.text
output_json = html_to_json.convert(html_string)
print(output_json)

使用这个 html_to_json 包 pip install html-to-json

import requests
import json
from bs4 import BeautifulSoup
import html_to_json

username = 'vgtgayan'
base_url = 'https://www.kaggle.com/'
url = base_url+str(username)

r = requests.get(url)
print(r.status_code)

html_string = r.text
output_json = html_to_json.convert(html_string)
print(output_json)