提取数据并按日期排序

Question

我想找出一个关于字符串操作和排序的练习。该练习要求从文本中提取具有时间参考的单词（例如，小时、天），并根据提取的时间按升序对行进行排序。数据示例为：

Customer     Text
1            12 hours ago — the customer applied for a discount
2            6 hours ago — the customer contacted the customer service
3            1 day ago — the customer reported an issue
4            1 day ago — no answer
4            2 days ago — Open issue
5

在这个任务中我可以确定几个难点：

- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime

关于第一点，我注意到日期通常在 — 之前，无论是否存在，因此很容易提取它们。关于第二点，if 语句可以避免由于 incomplete/missing 字段而导致的错误消息。不过，我不知道如何回答第三点。

我的预期结果是：

Customer     Text                                                        Sort by
1            12 hours ago — the customer applied for a discount             1
2            6 hours ago — the customer contacted the customer service      2
3            1 day ago — the customer reported an issue                     2
4            1 day ago — no answer                                          2
4            2 days ago — Open issue                                        3
5

Answer 1

鉴于 DataFrame 示例，我将假设对于本练习，文本的前两个词就是您要查找的内容。我不清楚排序是如何进行的，但是对于第三点，更合适的时间是文本列

中的current time - timedelta

您可以将 if-else lambda 函数应用于 Text 每行的前两个词并将其转换为 pandas Timedelta 对象 - 例如 pd.Timedelta("1 day") 将 return 一个 Timedelta 对象。

然后你可以从当前时间中减去 Timedelta 列，你可以用 pd.Timestamp.now():

df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]

输出：

>>> df
   Customer                                               Text       Timedelta                       Time
0         1  12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1         2  6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2         3         1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3         4                              1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4         4                            2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5         5                                                NaN             NaT                        NaT

提取数据并按日期排序

Extract data and sort them by date

python

data-manipulation

pandas