提取数据并按日期排序

Extract data and sort them by date

我想找出一个关于字符串操作和排序的练习。 该练习要求从文本中提取具有时间参考的单词(例如,小时、天),并根据提取的时间按升序对行进行排序。 数据示例为:

Customer     Text
1            12 hours ago — the customer applied for a discount
2            6 hours ago — the customer contacted the customer service
3            1 day ago — the customer reported an issue
4            1 day ago — no answer
4            2 days ago — Open issue
5            

在这个任务中我可以确定几个难点:

- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime

关于第一点,我注意到日期通常在 — 之前,无论是否存在,因此很容易提取它们。 关于第二点,if 语句可以避免由于 incomplete/missing 字段而导致的错误消息。 不过,我不知道如何回答第三点。

我的预期结果是:

Customer     Text                                                        Sort by
1            12 hours ago — the customer applied for a discount             1
2            6 hours ago — the customer contacted the customer service      2
3            1 day ago — the customer reported an issue                     2
4            1 day ago — no answer                                          2
4            2 days ago — Open issue                                        3
5            

鉴于 DataFrame 示例,我将假设对于本练习,文本的前两个词就是您要查找的内容。我不清楚排序是如何进行的,但是对于第三点,更合适的时间是文本列

中的current time - timedelta

您可以将 if-else lambda 函数应用于 Text 每行的前两个词并将其转换为 pandas Timedelta 对象 - 例如 pd.Timedelta("1 day") 将 return 一个 Timedelta 对象。

然后你可以从当前时间中减去 Timedelta 列,你可以用 pd.Timestamp.now():

df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]

输出:

>>> df
   Customer                                               Text       Timedelta                       Time
0         1  12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1         2  6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2         3         1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3         4                              1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4         4                            2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5         5                                                NaN             NaT                        NaT