来自 Python Pandas Dataframe 的重复嵌套列表

Question

这是我的数据框例如：

                 requesttime  checkinperiod

0   2016-10-16T14:53:58.000Z              8

1   2016-10-16T22:53:22.000Z              8

2   2016-10-18T14:52:22.000Z              8

3   2016-10-18T06:53:08.000Z              8

4   2016-10-16T06:53:37.000Z              8

5   2016-10-15T22:53:14.000Z              8

6   2016-10-19T22:51:51.000Z              8

7   2016-10-22T10:16:57.000Z             12

8   2016-10-20T10:54:37.000Z             12

9   2016-10-20T06:51:42.000Z             12

10  2016-10-10T22:44:17.000Z             24

11  2016-10-13T22:47:26.000Z              8

12  2016-10-14T14:53:27.000Z              8

13  2016-10-14T22:53:58.000Z              8

14  2016-10-15T06:53:28.000Z              8

15  2016-10-14T06:53:58.000Z              8

16  2016-10-10T16:38:28.000Z             24

17  2016-10-17T06:53:50.000Z              8

18  2016-10-17T14:53:12.000Z              8

19  2016-10-19T14:51:53.000Z              8

20  2016-10-17T22:53:44.000Z              8

21  2016-10-15T14:53:50.000Z              8

22  2016-10-18T22:52:39.000Z              8

23  2016-10-12T22:27:51.000Z             24

24  2016-10-11T23:05:57.000Z             24

25  2016-10-19T06:52:53.000Z              8

26  2016-10-21T10:09:09.000Z             12

27  2016-10-21T22:17:15.000Z             12

28  2016-10-22T22:16:53.000Z             12

29  2016-10-20T23:02:13.000Z             12

期望的输出：

{

8 : [
        [2016-10-16T14:53:58.000Z, 2016-10-16T22:53:22.000Z, 2016-10-18T14:52:22.000Z, 2016-10-16T06:53:37.000Z, 2016-10-15T22:53:14.000Z, 2016-10-19T22:51:51.000Z],
        [2016-10-13T22:47:26.000Z, 2016-10-13T22:47:26.000Z, 2016-10-14T22:53:58.000Z, 2016-10-15T06:53:28.000Z, 2016-10-14T06:53:58.000Z],
        [2016-10-17T06:53:50.000Z, 2016-10-17T14:53:12.000Z, 2016-10-19T14:51:53.000Z, 2016-10-17T22:53:44.000Z, 2016-10-15T14:53:50.000Z, 2016-10-18T22:52:39.000Z],
        [2016-10-19T06:52:53.000Z]
],
12: [
        [2016-10-22T10:16:57.000Z, 2016-10-20T10:54:37.000Z, 2016-10-20T06:51:42.000Z],
        [2016-10-21T10:09:09.000Z, 2016-10-21T22:17:15.000Z, 2016-10-22T22:16:53.000Z, 2016-10-20T23:02:13.000Z]
],
24: [
        [2016-10-10T22:44:17.000Z],
        [2016-10-10T16:38:28.000Z],
        [2016-10-12T22:27:51.000Z, 2016-10-11T23:05:57.000Z]
]
}

谢谢峰会

Answer 1

使用正则表达式过滤数据并设置字典键尝试text 2 regex

Answer 2

import pandas as pd

# make sample data
col = 'checkinperiod'
df = pd.DataFrame([['a', 8], ['b', 8], ['c', 8],['c', 12], ['d', 8], ['e', 12], ['f', 12]], 
                  columns=['requesttime', col])
print df

  requesttime  checkinperiod
0           a              8
1           b              8
2           c              8
3           c             12
4           d              8
5           e             12
6           f             12 

# shift the dataframe one row down and compare with previous row
df['group'] = (df[col].shift(1) != df[col]).astype(int).cumsum()
print df

  requesttime  checkinperiod  group
0           a              8      1
1           b              8      1
2           c              8      1
3           c             12      2
4           d              8      3
5           e             12      4
6           f             12      4

# group by those groups and combine the results
df_grouped = pd.DataFrame(df.groupby([col, 'group']).apply(
    lambda df: list(df['requesttime'])))
df_grouped = df_grouped.reset_index().drop('group', axis=1)
print df_grouped

   checkinperiod          0
0              8  [a, b, c]
1              8        [d]
2             12        [c]
3             12     [e, f]

result = df_grouped.groupby(col).apply(lambda df: list(df[0])).to_dict()
print result

{8: [['a', 'b', 'c'], ['d']], 12: [['c'], ['e', 'f']]}

灵感来自 [1]

来自 Python Pandas Dataframe 的重复嵌套列表

Duplicated Nested list from Python Pandas Dataframe

python

grouping

nested-lists

pandas