如何从段落中提取日期字符串 "Mar 11, 2019 • 3:26AM" 并将其转换为 python 中的日期时间格式 (dd/mm/yy)
how do I extract date string "Mar 11, 2019 • 3:26AM" from a paragraph and convert it to date time format (dd/mm/yy) in python
我有一段包含日期和评论等详细信息,我需要将其提取并单独列成一栏。该段落位于我从中提取日期的列中,如下所示:
'Story\nFAQ\nUpdates 2\nComments 35\nby Antaio Inc\nMar 11, 2019 • 3:26AM\n2 years ago\nThank you all for an amazing start!\nHi all,\nWe just want to thank you all for an awesome start! This is our first ever Indiegogo campaign and we are very grateful for your support that helped us achieve a successful campaign.\nIn the next little while, we will be dedicating our effort on production and shipping of the awesome A-Buds and A-Buds SE. We plan to ship them to you as promised in the coming month.\nWe will send out more updates as we are approaching the key production dates.\nStay tuned!\nBest regards,\nAntaio Team\nby Antaio Inc\nJan 31, 2019 • 5:15AM\nover 2 years ago\nPre-Production Update\nDear all,\nWe want to take this opportunity to thank all of you for being our early backers. You guys rock! :)\nAs you may have noticed, the A-Buds are already in production stage, which means we have already completed all development and testing, and are now working on pre-production. Not only will you receive fully tested and certified awesome A-Buds after the campaign, we are also giving you the promise to deliver them on time! We are truly excited to have these awesome true Bluetooth 5.0 earbuds in your hands. We are sure you will love them!\nSo here is a quick sneak peek:\nMore to come. Stay tuned! :)\nFrom: Antaio Team\nRead More'
这种段落出现在数据集的每一行中的特定列 'Project_Updates_Description' 中。我正在尝试提取每个条目中的第一个日期
目前我使用的代码是:
for i in df['Project_Updates_Description']:
if type(i) == str:
print(count)
word = i.split('\n',7)
count+=1
if len(word) > 5:
print(word[5])
df['Date'] = word[5]
我现在遇到的问题是,当我从段落中提取日期时,我将它作为字符串获取,我需要它作为 dd/mm/yyyy 格式,我尝试了像 strptime 这样的方法,但它没有用作为字符串追加,当我尝试将它追加到新的 'Date' 列时,我一直为所有条目获取相同的日期。有人能告诉我我哪里做错了吗?
假设您有一个包含标题为 'Project_Updates_Description' 的列的数据框,其中包含示例文本并且您想要提取第一个日期并从此信息生成日期时间戳,您可以执行以下操作:
import pandas as pd
import numpy as np
def findDate(txin):
schptrn = '^\w+ \d{1,2}, \d{4,4}'
lines = txin.split('\n')
for line in lines:
#print(line)
data = re.findall(schptrn, line)[0]
if data:
#print(data)
return pd.to_datetime(data)
return np.nan
df['date'] = df.apply(lambda row: findDate(row['Project_Updates_Description']), axis = 1)
我有一段包含日期和评论等详细信息,我需要将其提取并单独列成一栏。该段落位于我从中提取日期的列中,如下所示:
'Story\nFAQ\nUpdates 2\nComments 35\nby Antaio Inc\nMar 11, 2019 • 3:26AM\n2 years ago\nThank you all for an amazing start!\nHi all,\nWe just want to thank you all for an awesome start! This is our first ever Indiegogo campaign and we are very grateful for your support that helped us achieve a successful campaign.\nIn the next little while, we will be dedicating our effort on production and shipping of the awesome A-Buds and A-Buds SE. We plan to ship them to you as promised in the coming month.\nWe will send out more updates as we are approaching the key production dates.\nStay tuned!\nBest regards,\nAntaio Team\nby Antaio Inc\nJan 31, 2019 • 5:15AM\nover 2 years ago\nPre-Production Update\nDear all,\nWe want to take this opportunity to thank all of you for being our early backers. You guys rock! :)\nAs you may have noticed, the A-Buds are already in production stage, which means we have already completed all development and testing, and are now working on pre-production. Not only will you receive fully tested and certified awesome A-Buds after the campaign, we are also giving you the promise to deliver them on time! We are truly excited to have these awesome true Bluetooth 5.0 earbuds in your hands. We are sure you will love them!\nSo here is a quick sneak peek:\nMore to come. Stay tuned! :)\nFrom: Antaio Team\nRead More'
这种段落出现在数据集的每一行中的特定列 'Project_Updates_Description' 中。我正在尝试提取每个条目中的第一个日期
目前我使用的代码是:
for i in df['Project_Updates_Description']:
if type(i) == str:
print(count)
word = i.split('\n',7)
count+=1
if len(word) > 5:
print(word[5])
df['Date'] = word[5]
我现在遇到的问题是,当我从段落中提取日期时,我将它作为字符串获取,我需要它作为 dd/mm/yyyy 格式,我尝试了像 strptime 这样的方法,但它没有用作为字符串追加,当我尝试将它追加到新的 'Date' 列时,我一直为所有条目获取相同的日期。有人能告诉我我哪里做错了吗?
假设您有一个包含标题为 'Project_Updates_Description' 的列的数据框,其中包含示例文本并且您想要提取第一个日期并从此信息生成日期时间戳,您可以执行以下操作:
import pandas as pd
import numpy as np
def findDate(txin):
schptrn = '^\w+ \d{1,2}, \d{4,4}'
lines = txin.split('\n')
for line in lines:
#print(line)
data = re.findall(schptrn, line)[0]
if data:
#print(data)
return pd.to_datetime(data)
return np.nan
df['date'] = df.apply(lambda row: findDate(row['Project_Updates_Description']), axis = 1)