Python DataFrame 中的 DateDiff

Python DateDiff in Dataframe

我基本上是在尝试将一组给定的数据时间与预定义的阈值进行比较。 Endobjective是如果超过阈值就获取某列中的行。

这是我到目前为止尝试过的代码:

#!/usr/bin/python

from datetime import datetime
import sys
import logging
import operator
import pymysql
import pandas as pd

db_endpoint = "awsendpoint"
db_username="user"
db_password="password"
db_name="database_name"
port = 3306

logger = logging.getLogger()
logger.setLevel(logging.INFO)

try:
    conn = pymysql.connect(db_endpoint, user=db_username,
                           passwd=db_password, db=db_name, connect_timeout=5)
except:
    logger.error("ERROR: Unexpected error: Could not connect to MySql instance.")
    sys.exit()

logger.info("SUCCESS: Connection to RDS mysql instance succeeded")

cur=conn.cursor()
cur.execute("select talendjobname, taskstartdate from taskexecutionhistory where basicstatus = 'RUNNING'")

#OUTPUT is : 
[('Prod_Adobe_Master_Process_v2', datetime.datetime(2018, 12, 17, 3, 30)), ('Prod_Sales_n_DG_Master_Process_v2', datetime.datetime(2018, 12, 17, 4, 0)), ('SDG_download_mail_attachments', datetime.datetime(2018, 12, 23, 3, 0, 1))]

aws = []
for row in cur:
    aws.append(row)
# All working upto this. 
aws = pd.DataFrame(aws)

aws_time = aws.iloc[:,1]

## I am getting the longer running jobs with respect to current time.
def days_between(d1):
# d1 = datetime.strptime(d1, "%Y-%m-%d")
return abs((datetime.now() - d1))

#Here is the problem
   OUTPUT is a list of : 3Days 11 hours 30 mins,
                         2Days 10 hours 12 mins, 
                         so on and so forth

我的阈值为 8 小时,我无法与此结果进行比较。我希望获得仅超过此阈值的工作列表。

我尝试了一些额外的东西:

time_passed = []
for i in range(0,len(aws_time.index)):
    x = days_between(aws_time[i])

    time_passed.append(x)

让我知道我遗漏了什么或者是否有任何不同的方法。 TimeDelta 是我苦苦挣扎的主要 class。我尝试进行字符串操作,但也无法将输出转换为字符串。

datetime.timedelta Python 中的对象有一个名为 .total_seconds() 的方法,您可以使用它来查找两个时间之间的小时数。

from datetime import datetime, timedelta
t1 = datetime.now()
t2 = datetime.now() - timedelta(hours=10)
type(t1 - t2)

datetime.timedelta

# Find total hours between times
(t1 - t2).total_seconds() / 3600

9.99999

您可以对您的操作进行矢量化,以一次找到所有以小时为单位的时差(使用 .dt 访问时差):

# Find the time differences and convert to hours
aws['time_diff'] = aws.iloc[:, 1].apply(days_between) 
aws['time_diff_hours'] = aws['time_diff'].dt.total_seconds() / 3600

然后子集到小时数大于 8 的行

greater_than_8_hours = aws[aws['time_diff_hours'] > 8]