从字符串中提取特定日期
Extracting specific dates from strings
我正在尝试从文本中提取一些特定日期。文本如下所示:
'Shares of Luxury Goods Makers Slip on Russia Export Ban',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Dhirendra Tripathi',
'Investing.com – Stocks of European retailers such as LVMH (PA:LVMH), Kering (PA:PRTP), H&M (ST:HMb), Moncler (MI:MONC) and Hermès (PA:HRMS) were all down around 4% Tuesday... ',
'',
'',
'',
' ',
'Europe Stocks Open Lower as Wider Sanctions, Covid Rebound Hit Mood',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Geoffrey Smith\xa0',
'Investing.com -- European stock markets opened lower on Tuesday as a fresh round of EU sanctions, a rebound in Covid-19 cases and more signs of red-hot inflation all weighed on... ',
'',
'\xa0',
显然在这个小片段中,id 只想提取:2022 年 3 月 15 日和 2022 年 3 月 15 日。
我尝试过:
datefinder.find_dates(text)
dateutil.parser
第一个 returns 我想要的所有日期加上一大堆不存在的其他日期。
第二个returns“字符串不包含日期:”
有谁能想到最好的方法吗?
你可以使用正则表达式
import re
line = r'By Investing.com\xa0-\xa0Mar 15, 2022 By Geoffrey Smith\xa0'
re_results = re.findall(r'[A-Z][a-z]{2} \d{1,2}, \d{4}', line)
for result in re_results:
print(result)
输出:
Mar 15, 2022
您可以在此处测试正则表达式https://regexr.com/
使用re
:
import re
x = ['Shares of Luxury Goods Makers Slip on Russia Export Ban',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Dhirendra Tripathi',
'Investing.com – Stocks of European retailers such as LVMH (PA:LVMH), Kering (PA:PRTP), H&M (ST:HMb), Moncler (MI:MONC) and Hermès (PA:HRMS) were all down around 4% Tuesday... ',
'',
'',
'',
' ',
'Europe Stocks Open Lower as Wider Sanctions, Covid Rebound Hit Mood',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Geoffrey Smith\xa0',
'Investing.com -- European stock markets opened lower on Tuesday as a fresh round of EU sanctions, a rebound in Covid-19 cases and more signs of red-hot inflation all weighed on... ',
'',
'\xa0', ]
for line in x:
m = re.search(r'\w{3} \d{1,2}, \d{4}', line)
if m:
print(m.group())
输出:
Mar 15, 2022
Mar 15, 2022
请注意,这只会匹配 [3 letters] [1-2 numbers] [4 numbers]
形式的日期
我正在尝试从文本中提取一些特定日期。文本如下所示:
'Shares of Luxury Goods Makers Slip on Russia Export Ban',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Dhirendra Tripathi',
'Investing.com – Stocks of European retailers such as LVMH (PA:LVMH), Kering (PA:PRTP), H&M (ST:HMb), Moncler (MI:MONC) and Hermès (PA:HRMS) were all down around 4% Tuesday... ',
'',
'',
'',
' ',
'Europe Stocks Open Lower as Wider Sanctions, Covid Rebound Hit Mood',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Geoffrey Smith\xa0',
'Investing.com -- European stock markets opened lower on Tuesday as a fresh round of EU sanctions, a rebound in Covid-19 cases and more signs of red-hot inflation all weighed on... ',
'',
'\xa0',
显然在这个小片段中,id 只想提取:2022 年 3 月 15 日和 2022 年 3 月 15 日。
我尝试过:
datefinder.find_dates(text)
dateutil.parser
第一个 returns 我想要的所有日期加上一大堆不存在的其他日期。
第二个returns“字符串不包含日期:”
有谁能想到最好的方法吗?
你可以使用正则表达式
import re
line = r'By Investing.com\xa0-\xa0Mar 15, 2022 By Geoffrey Smith\xa0'
re_results = re.findall(r'[A-Z][a-z]{2} \d{1,2}, \d{4}', line)
for result in re_results:
print(result)
输出:
Mar 15, 2022
您可以在此处测试正则表达式https://regexr.com/
使用re
:
import re
x = ['Shares of Luxury Goods Makers Slip on Russia Export Ban',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Dhirendra Tripathi',
'Investing.com – Stocks of European retailers such as LVMH (PA:LVMH), Kering (PA:PRTP), H&M (ST:HMb), Moncler (MI:MONC) and Hermès (PA:HRMS) were all down around 4% Tuesday... ',
'',
'',
'',
' ',
'Europe Stocks Open Lower as Wider Sanctions, Covid Rebound Hit Mood',
'By Investing.com\xa0-\xa0Mar 15, 2022 By Geoffrey Smith\xa0',
'Investing.com -- European stock markets opened lower on Tuesday as a fresh round of EU sanctions, a rebound in Covid-19 cases and more signs of red-hot inflation all weighed on... ',
'',
'\xa0', ]
for line in x:
m = re.search(r'\w{3} \d{1,2}, \d{4}', line)
if m:
print(m.group())
输出:
Mar 15, 2022
Mar 15, 2022
请注意,这只会匹配 [3 letters] [1-2 numbers] [4 numbers]