提取数据并按日期排序
Extract data and sort them by date
我想找出一个关于字符串操作和排序的练习。
该练习要求从文本中提取具有时间参考的单词(例如,小时、天),并根据提取的时间按升序对行进行排序。
数据示例为:
Customer Text
1 12 hours ago — the customer applied for a discount
2 6 hours ago — the customer contacted the customer service
3 1 day ago — the customer reported an issue
4 1 day ago — no answer
4 2 days ago — Open issue
5
在这个任务中我可以确定几个难点:
- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime
关于第一点,我注意到日期通常在 — 之前,无论是否存在,因此很容易提取它们。
关于第二点,if 语句可以避免由于 incomplete/missing 字段而导致的错误消息。
不过,我不知道如何回答第三点。
我的预期结果是:
Customer Text Sort by
1 12 hours ago — the customer applied for a discount 1
2 6 hours ago — the customer contacted the customer service 2
3 1 day ago — the customer reported an issue 2
4 1 day ago — no answer 2
4 2 days ago — Open issue 3
5
鉴于 DataFrame 示例,我将假设对于本练习,文本的前两个词就是您要查找的内容。我不清楚排序是如何进行的,但是对于第三点,更合适的时间是文本列
中的current time - timedelta
您可以将 if-else lambda 函数应用于 Text
每行的前两个词并将其转换为 pandas Timedelta 对象 - 例如 pd.Timedelta("1 day")
将 return 一个 Timedelta 对象。
然后你可以从当前时间中减去 Timedelta 列,你可以用 pd.Timestamp.now()
:
df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]
输出:
>>> df
Customer Text Timedelta Time
0 1 12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1 2 6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2 3 1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3 4 1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4 4 2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5 5 NaN NaT NaT
我想找出一个关于字符串操作和排序的练习。 该练习要求从文本中提取具有时间参考的单词(例如,小时、天),并根据提取的时间按升序对行进行排序。 数据示例为:
Customer Text
1 12 hours ago — the customer applied for a discount
2 6 hours ago — the customer contacted the customer service
3 1 day ago — the customer reported an issue
4 1 day ago — no answer
4 2 days ago — Open issue
5
在这个任务中我可以确定几个难点:
- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime
关于第一点,我注意到日期通常在 — 之前,无论是否存在,因此很容易提取它们。 关于第二点,if 语句可以避免由于 incomplete/missing 字段而导致的错误消息。 不过,我不知道如何回答第三点。
我的预期结果是:
Customer Text Sort by
1 12 hours ago — the customer applied for a discount 1
2 6 hours ago — the customer contacted the customer service 2
3 1 day ago — the customer reported an issue 2
4 1 day ago — no answer 2
4 2 days ago — Open issue 3
5
鉴于 DataFrame 示例,我将假设对于本练习,文本的前两个词就是您要查找的内容。我不清楚排序是如何进行的,但是对于第三点,更合适的时间是文本列
中的current time - timedelta
您可以将 if-else lambda 函数应用于 Text
每行的前两个词并将其转换为 pandas Timedelta 对象 - 例如 pd.Timedelta("1 day")
将 return 一个 Timedelta 对象。
然后你可以从当前时间中减去 Timedelta 列,你可以用 pd.Timestamp.now()
:
df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]
输出:
>>> df
Customer Text Timedelta Time
0 1 12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1 2 6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2 3 1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3 4 1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4 4 2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5 5 NaN NaT NaT