Python 字符串到日期时间 - Yelp 数据

Python string to datetime - Yelp data

Yelp 数据集以字符串形式提供签到信息:

Business_id 日期
一个 2010-04-22 05:31:33, 2010-05-09 18:24:50,...
B 2010-03-07 02:04:38, 2010-04-11 01:45:57, 2014-05-02 18:40:35, 2014-05-06 17:59:33,. ..

我想计算每个商家每天签到的人数

如果特定商家的签到字符串存储在 checkIns 中,那么您可以:

from datetime import datetime
import re

rawCheckInsString = "2010-03-07 02:04:38, 2010-04-11 01:45:57,2014-05-02 18:40:35, 2014-05-06 17:59:33"

checkIns = [datetime.strptime(datetimeString, '%Y-%m-%d %H:%M:%S') for datetimeString in re.split(r",\s?", rawCheckInsString)]

todayDate = datetime.today().date()
result = 0
for dt in checkIns:
    if dt.date() == todayDate:
        result += 1

print(result)

这显然不处理时区。

假设您的数据在文本文件或 CSV 文件中:

示例数据

A,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2021-07-06 08:02:15,2021-07-06 10:01:18"
B,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-07 18:02:33,2021-07-06 08:05:15,2021-07-06 10:01:20"
C,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-08 16:05:20,2014-05-08 17:06:10,2021-07-06 10:01:19,2021-07-06 08:02:30,2021-07-06 10:01:20,2021-07-06 10:01:28"

您可以将数据读入 Dataframe 并尝试以下操作:

df = pd.read_csv(r"/dir/filepath/filename.txt", header=None, delimiter=',')
df.columns = ["B_id", "Date"]

# explode converts the list into separate rows

df = df.assign(Date= df.Date.str.split(',')).explode("Date")
df["Date"] = pd.to_datetime(df["Date"])
print(df)
today = datetime.today().date()
today_df = df[df["Date"].dt.date == today]
grouped_df = today_df.groupby("B_id")["Date"].count()
grouped_df.head()

.explode() 后的输出:

    B_id    Date
0   A   2010-03-07 02:04:38
0   A   2010-04-11 01:45:57
0   A   2014-05-02 18:40:35
0   A   2014-05-06 17:59:33
0   A   2021-07-06 08:02:15
0   A   2021-07-06 10:01:18
1   B   2010-03-07 02:04:38
1   B   2010-04-11 01:45:57
1   B   2014-05-02 18:40:35
1   B   2014-05-06 17:59:33
1   B   2014-05-07 18:02:33
1   B   2021-07-06 08:05:15
1   B   2021-07-06 10:01:20
2   C   2010-03-07 02:04:38
2   C   2010-04-11 01:45:57
2   C   2014-05-02 18:40:35
2   C   2014-05-06 17:59:33
2   C   2014-05-08 16:05:20
2   C   2014-05-08 17:06:10

最终输出:

B_id  
A     2
B     2
C     4