Python 如何从 Web Scraping 构建数据框
How to construct data frame from Web Scraping in Python
我可以通过 Python 中的网络抓取从网页中获取数据。我的数据被提取到列表中。但不知道如何将该列表转换为数据框。有什么办法可以通过网络抓取数据并将其直接提取到 df 中吗?
这是我的代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
from pandas import DataFrame
import lxml
# GET the response from the web page using requests library
res = requests.get("https://www.worldometers.info/coronavirus/")
# PARSE and fetch content using BeutifulSoup method of bs4 library
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
# Here dumping the fetched data to have a look
print( tabulate(df[0], headers='keys', tablefmt='psql') )
print(df[0])
好吧 read_html
returns 一个 DataFrame 列表(根据 documentation),所以你必须得到 "first"(且仅)该列表的元素。
我会在最后添加(在您调用 read_html
之后):
df = df[0]
然后你可以查看它的信息得到:
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 207 entries, 0 to 206
# Data columns (total 10 columns):
# Country,Other 207 non-null object
# TotalCases 207 non-null int64
# NewCases 59 non-null object
# TotalDeaths 144 non-null float64
# NewDeaths 31 non-null float64
# TotalRecovered 154 non-null float64
# ActiveCases 207 non-null int64
# Serious,Critical 112 non-null float64
# Tot Cases/1M pop 205 non-null float64
# Deaths/1M pop 142 non-null float64
# dtypes: float64(6), int64(2), object(2)
# memory usage: 16.3+ KB
import requests
import pandas as pd
r = requests.get("https://www.worldometers.info/coronavirus/")
df = pd.read_html(r.content)[0]
print(type(df))
# <class 'pandas.core.frame.DataFrame'>
df.to_csv("data.csv", index=False)
输出:view
我可以通过 Python 中的网络抓取从网页中获取数据。我的数据被提取到列表中。但不知道如何将该列表转换为数据框。有什么办法可以通过网络抓取数据并将其直接提取到 df 中吗? 这是我的代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
from pandas import DataFrame
import lxml
# GET the response from the web page using requests library
res = requests.get("https://www.worldometers.info/coronavirus/")
# PARSE and fetch content using BeutifulSoup method of bs4 library
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
# Here dumping the fetched data to have a look
print( tabulate(df[0], headers='keys', tablefmt='psql') )
print(df[0])
好吧 read_html
returns 一个 DataFrame 列表(根据 documentation),所以你必须得到 "first"(且仅)该列表的元素。
我会在最后添加(在您调用 read_html
之后):
df = df[0]
然后你可以查看它的信息得到:
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 207 entries, 0 to 206
# Data columns (total 10 columns):
# Country,Other 207 non-null object
# TotalCases 207 non-null int64
# NewCases 59 non-null object
# TotalDeaths 144 non-null float64
# NewDeaths 31 non-null float64
# TotalRecovered 154 non-null float64
# ActiveCases 207 non-null int64
# Serious,Critical 112 non-null float64
# Tot Cases/1M pop 205 non-null float64
# Deaths/1M pop 142 non-null float64
# dtypes: float64(6), int64(2), object(2)
# memory usage: 16.3+ KB
import requests
import pandas as pd
r = requests.get("https://www.worldometers.info/coronavirus/")
df = pd.read_html(r.content)[0]
print(type(df))
# <class 'pandas.core.frame.DataFrame'>
df.to_csv("data.csv", index=False)
输出:view