从刮擦中清除日期
Clean Dates from scraping
我正在抓取一个 HMTL table,最终的数据框创建了一个需要清理和格式化的“日期”列。
我的范围是将此列转换为数据列。
在我的数据框下方:
完成此步骤后我要做的就是清理 Date
列并将此列转换为 pandas 日期列。
有什么帮助吗?
这里是如何生成这个 table:
## web scrapping
import requests
import lxml.html as lh
import pandas as pd
url='https://markets.ft.com/data/funds/tearsheet/historical?s=LU0841585341:GBP'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
# print '%d:"%s"'%(i,name)
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=6:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
您可以像这样将字符串转换为日期时间:
from datetime import datetime
d='September 10, 2021Fri, Sep 10, 2021'
print(datetime.strptime(''.join(d.split(',')[-2:]), ' %b %d %Y'))
输出:2021-09-10 00:00:00
上面的不同步骤是:
- How to get last items of a list in Python?
- How to convert list to string
- Convert string string into datetime
你可以这样做:
df["Date"] = pd.to_datetime(
df["Date"].str.replace(r"(\d+)([A-Z].*)", r"", regex=True)
)
print(df)
打印:
Date Open High Low Close Volume
0 2021-09-10 27.28 27.28 27.28 27.28 ----
1 2021-09-09 27.35 27.35 27.35 27.35 ----
2 2021-09-08 27.42 27.42 27.42 27.42 ----
3 2021-09-07 27.54 27.54 27.54 27.54 ----
4 2021-09-03 27.44 27.44 27.44 27.44 ----
5 2021-09-02 27.48 27.48 27.48 27.48 ----
6 2021-09-01 27.26 27.26 27.26 27.26 ----
7 2021-08-31 27.31 27.31 27.31 27.31 ----
8 2021-08-30 27.46 27.46 27.46 27.46 ----
9 2021-08-27 27.32 27.32 27.32 27.32 ----
10 2021-08-26 27.23 27.23 27.23 27.23 ----
11 2021-08-25 27.27 27.27 27.27 27.27 ----
12 2021-08-24 27.22 27.22 27.22 27.22 ----
13 2021-08-23 27.05 27.05 27.05 27.05 ----
14 2021-08-20 26.92 26.92 26.92 26.92 ----
15 2021-08-19 26.58 26.58 26.58 26.58 ----
16 2021-08-18 26.62 26.62 26.62 26.62 ----
17 2021-08-17 26.63 26.63 26.63 26.63 ----
18 2021-08-16 26.56 26.56 26.56 26.56 ----
19 2021-08-13 26.77 26.77 26.77 26.77 ----
20 2021-08-12 26.67 26.67 26.67 26.67 ----
我正在抓取一个 HMTL table,最终的数据框创建了一个需要清理和格式化的“日期”列。
我的范围是将此列转换为数据列。
在我的数据框下方:
完成此步骤后我要做的就是清理 Date
列并将此列转换为 pandas 日期列。
有什么帮助吗?
这里是如何生成这个 table:
## web scrapping
import requests
import lxml.html as lh
import pandas as pd
url='https://markets.ft.com/data/funds/tearsheet/historical?s=LU0841585341:GBP'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
# print '%d:"%s"'%(i,name)
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=6:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
您可以像这样将字符串转换为日期时间:
from datetime import datetime
d='September 10, 2021Fri, Sep 10, 2021'
print(datetime.strptime(''.join(d.split(',')[-2:]), ' %b %d %Y'))
输出:2021-09-10 00:00:00
上面的不同步骤是:
- How to get last items of a list in Python?
- How to convert list to string
- Convert string string into datetime
你可以这样做:
df["Date"] = pd.to_datetime(
df["Date"].str.replace(r"(\d+)([A-Z].*)", r"", regex=True)
)
print(df)
打印:
Date Open High Low Close Volume
0 2021-09-10 27.28 27.28 27.28 27.28 ----
1 2021-09-09 27.35 27.35 27.35 27.35 ----
2 2021-09-08 27.42 27.42 27.42 27.42 ----
3 2021-09-07 27.54 27.54 27.54 27.54 ----
4 2021-09-03 27.44 27.44 27.44 27.44 ----
5 2021-09-02 27.48 27.48 27.48 27.48 ----
6 2021-09-01 27.26 27.26 27.26 27.26 ----
7 2021-08-31 27.31 27.31 27.31 27.31 ----
8 2021-08-30 27.46 27.46 27.46 27.46 ----
9 2021-08-27 27.32 27.32 27.32 27.32 ----
10 2021-08-26 27.23 27.23 27.23 27.23 ----
11 2021-08-25 27.27 27.27 27.27 27.27 ----
12 2021-08-24 27.22 27.22 27.22 27.22 ----
13 2021-08-23 27.05 27.05 27.05 27.05 ----
14 2021-08-20 26.92 26.92 26.92 26.92 ----
15 2021-08-19 26.58 26.58 26.58 26.58 ----
16 2021-08-18 26.62 26.62 26.62 26.62 ----
17 2021-08-17 26.63 26.63 26.63 26.63 ----
18 2021-08-16 26.56 26.56 26.56 26.56 ----
19 2021-08-13 26.77 26.77 26.77 26.77 ----
20 2021-08-12 26.67 26.67 26.67 26.67 ----