pandas combine_first 导致更多的行数
pandas combine_first resulting in more number of rows
在以下数据中,我需要将 'DATE' 列中的日期更改为之前的日期(DATE - 1 天),其中 CLOCKDATETIME 小时小于“4:00”。我已经达到了可以获取小时数小于“4:00”的行并更改日期并将结果与输入相结合的程度,但是对于输入的 29 行数据,我没有得到所需的结果我得到的最终结果为 41 行,其中行数应保持不变。如何组合数据框并获得所需的结果(行数应与输入行数保持相同)?
SAMPLE DATA IN CSV FORMAT:
DATE,CARD,CLOCKDATETIME
2015-05-01,100672,2015-05-01 00:03:00
2015-05-01,350132,2015-05-01 00:03:00
2015-05-01,100327,2015-05-01 00:07:00
2015-05-01,350075,2015-05-01 00:07:00
2015-05-01,300148,2015-05-01 00:07:00
2015-05-01,300344,2015-05-01 00:09:00
2015-05-01,100799,2015-05-01 00:11:00
2015-05-01,100771,2015-05-01 00:12:00
2015-05-01,100650,2015-05-01 00:14:00
2015-05-01,100771,2015-05-01 00:15:00
2015-05-01,100186,2015-05-01 00:16:00
2015-05-01,300279,2015-05-01 00:17:00
2015-05-01,300344,2015-05-01 00:17:00
2015-05-01,300148,2015-05-01 00:22:00
2015-05-01,100650,2015-05-01 00:22:00
2015-05-01,100799,2015-05-01 00:23:00
2015-05-01,100582,2015-05-01 00:26:00
2015-05-01,100887,2015-05-01 00:27:00
2015-05-01,100887,2015-05-01 00:30:00
2015-05-01,100746,2015-05-01 08:31:00
2015-05-01,100684,2015-05-01 08:33:00
2015-05-01,100073,2015-05-01 08:33:00
2015-05-01,100771,2015-05-01 08:47:00
2015-05-01,200011,2015-05-01 08:59:00
2015-05-01,100259,2015-05-01 09:07:00
2015-05-01,100631,2015-05-01 09:07:00
2015-05-01,100746,2015-05-01 09:07:00
2015-05-01,200032,2015-05-01 09:08:00
2015-05-01,100684,2015-05-01 09:09:00
以下是我现在的代码:
import pandas as pd
from pandas.tseries.offsets import Day
bi = pd.read_csv('bi2.csv', parse_dates=[0,2])
bic = bi.sort_values(by=bi.columns[2])
bic.set_index(['CLOCKDATETIME'], inplace=True)
bid = bic.between_time('00:00','04:00')
bid.DATE = bid.DATE - Day()
bie = bid.combine_first(bic)
excess_rows = len(bie) - len(bi)
print excess_rows
试试这个:
from __future__ import print_function
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['DATE','CLOCKDATETIME'])
df.loc[(df['CLOCKDATETIME'].dt.hour <= 4), 'DATE'] -= pd.Timedelta('1 days')
print(df)
输出:
DATE CARD CLOCKDATETIME
0 2015-04-30 100672 2015-05-01 00:03:00
1 2015-04-30 350132 2015-05-01 00:03:00
2 2015-04-30 100327 2015-05-01 00:07:00
3 2015-04-30 350075 2015-05-01 00:07:00
4 2015-04-30 300148 2015-05-01 00:07:00
5 2015-04-30 300344 2015-05-01 00:09:00
6 2015-04-30 100799 2015-05-01 00:11:00
7 2015-04-30 100771 2015-05-01 00:12:00
8 2015-04-30 100650 2015-05-01 00:14:00
9 2015-04-30 100771 2015-05-01 00:15:00
10 2015-04-30 100186 2015-05-01 00:16:00
11 2015-04-30 300279 2015-05-01 00:17:00
12 2015-04-30 300344 2015-05-01 00:17:00
13 2015-04-30 300148 2015-05-01 00:22:00
14 2015-04-30 100650 2015-05-01 00:22:00
15 2015-04-30 100799 2015-05-01 00:23:00
16 2015-04-30 100582 2015-05-01 00:26:00
17 2015-04-30 100887 2015-05-01 00:27:00
18 2015-04-30 100887 2015-05-01 00:30:00
19 2015-05-01 100746 2015-05-01 08:31:00
20 2015-05-01 100684 2015-05-01 08:33:00
21 2015-05-01 100073 2015-05-01 08:33:00
22 2015-05-01 100771 2015-05-01 08:47:00
23 2015-05-01 200011 2015-05-01 08:59:00
24 2015-05-01 100259 2015-05-01 09:07:00
25 2015-05-01 100631 2015-05-01 09:07:00
26 2015-05-01 100746 2015-05-01 09:07:00
27 2015-05-01 200032 2015-05-01 09:08:00
28 2015-05-01 100684 2015-05-01 09:09:00
在你的情况下 .loc
将完成工作:
bi.loc[bi.CLOCKDATETIME - bi.DATE < '04:00:00', 'DATE'] = bi.DATE - Day()
在以下数据中,我需要将 'DATE' 列中的日期更改为之前的日期(DATE - 1 天),其中 CLOCKDATETIME 小时小于“4:00”。我已经达到了可以获取小时数小于“4:00”的行并更改日期并将结果与输入相结合的程度,但是对于输入的 29 行数据,我没有得到所需的结果我得到的最终结果为 41 行,其中行数应保持不变。如何组合数据框并获得所需的结果(行数应与输入行数保持相同)?
SAMPLE DATA IN CSV FORMAT:
DATE,CARD,CLOCKDATETIME
2015-05-01,100672,2015-05-01 00:03:00
2015-05-01,350132,2015-05-01 00:03:00
2015-05-01,100327,2015-05-01 00:07:00
2015-05-01,350075,2015-05-01 00:07:00
2015-05-01,300148,2015-05-01 00:07:00
2015-05-01,300344,2015-05-01 00:09:00
2015-05-01,100799,2015-05-01 00:11:00
2015-05-01,100771,2015-05-01 00:12:00
2015-05-01,100650,2015-05-01 00:14:00
2015-05-01,100771,2015-05-01 00:15:00
2015-05-01,100186,2015-05-01 00:16:00
2015-05-01,300279,2015-05-01 00:17:00
2015-05-01,300344,2015-05-01 00:17:00
2015-05-01,300148,2015-05-01 00:22:00
2015-05-01,100650,2015-05-01 00:22:00
2015-05-01,100799,2015-05-01 00:23:00
2015-05-01,100582,2015-05-01 00:26:00
2015-05-01,100887,2015-05-01 00:27:00
2015-05-01,100887,2015-05-01 00:30:00
2015-05-01,100746,2015-05-01 08:31:00
2015-05-01,100684,2015-05-01 08:33:00
2015-05-01,100073,2015-05-01 08:33:00
2015-05-01,100771,2015-05-01 08:47:00
2015-05-01,200011,2015-05-01 08:59:00
2015-05-01,100259,2015-05-01 09:07:00
2015-05-01,100631,2015-05-01 09:07:00
2015-05-01,100746,2015-05-01 09:07:00
2015-05-01,200032,2015-05-01 09:08:00
2015-05-01,100684,2015-05-01 09:09:00
以下是我现在的代码:
import pandas as pd
from pandas.tseries.offsets import Day
bi = pd.read_csv('bi2.csv', parse_dates=[0,2])
bic = bi.sort_values(by=bi.columns[2])
bic.set_index(['CLOCKDATETIME'], inplace=True)
bid = bic.between_time('00:00','04:00')
bid.DATE = bid.DATE - Day()
bie = bid.combine_first(bic)
excess_rows = len(bie) - len(bi)
print excess_rows
试试这个:
from __future__ import print_function
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['DATE','CLOCKDATETIME'])
df.loc[(df['CLOCKDATETIME'].dt.hour <= 4), 'DATE'] -= pd.Timedelta('1 days')
print(df)
输出:
DATE CARD CLOCKDATETIME
0 2015-04-30 100672 2015-05-01 00:03:00
1 2015-04-30 350132 2015-05-01 00:03:00
2 2015-04-30 100327 2015-05-01 00:07:00
3 2015-04-30 350075 2015-05-01 00:07:00
4 2015-04-30 300148 2015-05-01 00:07:00
5 2015-04-30 300344 2015-05-01 00:09:00
6 2015-04-30 100799 2015-05-01 00:11:00
7 2015-04-30 100771 2015-05-01 00:12:00
8 2015-04-30 100650 2015-05-01 00:14:00
9 2015-04-30 100771 2015-05-01 00:15:00
10 2015-04-30 100186 2015-05-01 00:16:00
11 2015-04-30 300279 2015-05-01 00:17:00
12 2015-04-30 300344 2015-05-01 00:17:00
13 2015-04-30 300148 2015-05-01 00:22:00
14 2015-04-30 100650 2015-05-01 00:22:00
15 2015-04-30 100799 2015-05-01 00:23:00
16 2015-04-30 100582 2015-05-01 00:26:00
17 2015-04-30 100887 2015-05-01 00:27:00
18 2015-04-30 100887 2015-05-01 00:30:00
19 2015-05-01 100746 2015-05-01 08:31:00
20 2015-05-01 100684 2015-05-01 08:33:00
21 2015-05-01 100073 2015-05-01 08:33:00
22 2015-05-01 100771 2015-05-01 08:47:00
23 2015-05-01 200011 2015-05-01 08:59:00
24 2015-05-01 100259 2015-05-01 09:07:00
25 2015-05-01 100631 2015-05-01 09:07:00
26 2015-05-01 100746 2015-05-01 09:07:00
27 2015-05-01 200032 2015-05-01 09:08:00
28 2015-05-01 100684 2015-05-01 09:09:00
在你的情况下 .loc
将完成工作:
bi.loc[bi.CLOCKDATETIME - bi.DATE < '04:00:00', 'DATE'] = bi.DATE - Day()