Python 字符串到日期时间 - Yelp 数据
Python string to datetime - Yelp data
Yelp 数据集以字符串形式提供签到信息:
Business_id
日期
一个
2010-04-22 05:31:33, 2010-05-09 18:24:50,...
B
2010-03-07 02:04:38, 2010-04-11 01:45:57, 2014-05-02 18:40:35, 2014-05-06 17:59:33,. ..
我想计算每个商家每天签到的人数
如果特定商家的签到字符串存储在 checkIns
中,那么您可以:
from datetime import datetime
import re
rawCheckInsString = "2010-03-07 02:04:38, 2010-04-11 01:45:57,2014-05-02 18:40:35, 2014-05-06 17:59:33"
checkIns = [datetime.strptime(datetimeString, '%Y-%m-%d %H:%M:%S') for datetimeString in re.split(r",\s?", rawCheckInsString)]
todayDate = datetime.today().date()
result = 0
for dt in checkIns:
if dt.date() == todayDate:
result += 1
print(result)
这显然不处理时区。
假设您的数据在文本文件或 CSV 文件中:
示例数据
A,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2021-07-06 08:02:15,2021-07-06 10:01:18"
B,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-07 18:02:33,2021-07-06 08:05:15,2021-07-06 10:01:20"
C,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-08 16:05:20,2014-05-08 17:06:10,2021-07-06 10:01:19,2021-07-06 08:02:30,2021-07-06 10:01:20,2021-07-06 10:01:28"
您可以将数据读入 Dataframe 并尝试以下操作:
df = pd.read_csv(r"/dir/filepath/filename.txt", header=None, delimiter=',')
df.columns = ["B_id", "Date"]
# explode converts the list into separate rows
df = df.assign(Date= df.Date.str.split(',')).explode("Date")
df["Date"] = pd.to_datetime(df["Date"])
print(df)
today = datetime.today().date()
today_df = df[df["Date"].dt.date == today]
grouped_df = today_df.groupby("B_id")["Date"].count()
grouped_df.head()
.explode() 后的输出:
B_id Date
0 A 2010-03-07 02:04:38
0 A 2010-04-11 01:45:57
0 A 2014-05-02 18:40:35
0 A 2014-05-06 17:59:33
0 A 2021-07-06 08:02:15
0 A 2021-07-06 10:01:18
1 B 2010-03-07 02:04:38
1 B 2010-04-11 01:45:57
1 B 2014-05-02 18:40:35
1 B 2014-05-06 17:59:33
1 B 2014-05-07 18:02:33
1 B 2021-07-06 08:05:15
1 B 2021-07-06 10:01:20
2 C 2010-03-07 02:04:38
2 C 2010-04-11 01:45:57
2 C 2014-05-02 18:40:35
2 C 2014-05-06 17:59:33
2 C 2014-05-08 16:05:20
2 C 2014-05-08 17:06:10
最终输出:
B_id
A 2
B 2
C 4
Yelp 数据集以字符串形式提供签到信息:
Business_id | 日期 |
---|---|
一个 | 2010-04-22 05:31:33, 2010-05-09 18:24:50,... |
B | 2010-03-07 02:04:38, 2010-04-11 01:45:57, 2014-05-02 18:40:35, 2014-05-06 17:59:33,. .. |
我想计算每个商家每天签到的人数
如果特定商家的签到字符串存储在 checkIns
中,那么您可以:
from datetime import datetime
import re
rawCheckInsString = "2010-03-07 02:04:38, 2010-04-11 01:45:57,2014-05-02 18:40:35, 2014-05-06 17:59:33"
checkIns = [datetime.strptime(datetimeString, '%Y-%m-%d %H:%M:%S') for datetimeString in re.split(r",\s?", rawCheckInsString)]
todayDate = datetime.today().date()
result = 0
for dt in checkIns:
if dt.date() == todayDate:
result += 1
print(result)
这显然不处理时区。
假设您的数据在文本文件或 CSV 文件中:
示例数据
A,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2021-07-06 08:02:15,2021-07-06 10:01:18"
B,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-07 18:02:33,2021-07-06 08:05:15,2021-07-06 10:01:20"
C,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-08 16:05:20,2014-05-08 17:06:10,2021-07-06 10:01:19,2021-07-06 08:02:30,2021-07-06 10:01:20,2021-07-06 10:01:28"
您可以将数据读入 Dataframe 并尝试以下操作:
df = pd.read_csv(r"/dir/filepath/filename.txt", header=None, delimiter=',')
df.columns = ["B_id", "Date"]
# explode converts the list into separate rows
df = df.assign(Date= df.Date.str.split(',')).explode("Date")
df["Date"] = pd.to_datetime(df["Date"])
print(df)
today = datetime.today().date()
today_df = df[df["Date"].dt.date == today]
grouped_df = today_df.groupby("B_id")["Date"].count()
grouped_df.head()
.explode() 后的输出:
B_id Date
0 A 2010-03-07 02:04:38
0 A 2010-04-11 01:45:57
0 A 2014-05-02 18:40:35
0 A 2014-05-06 17:59:33
0 A 2021-07-06 08:02:15
0 A 2021-07-06 10:01:18
1 B 2010-03-07 02:04:38
1 B 2010-04-11 01:45:57
1 B 2014-05-02 18:40:35
1 B 2014-05-06 17:59:33
1 B 2014-05-07 18:02:33
1 B 2021-07-06 08:05:15
1 B 2021-07-06 10:01:20
2 C 2010-03-07 02:04:38
2 C 2010-04-11 01:45:57
2 C 2014-05-02 18:40:35
2 C 2014-05-06 17:59:33
2 C 2014-05-08 16:05:20
2 C 2014-05-08 17:06:10
最终输出:
B_id
A 2
B 2
C 4