基于 encoding/convert 的用户及其在 pandas 中的交互

User based encoding/convert with its interaction in pandas

我有这样的数据框:

user_id : 代表用户

question_id : 代表题号

user_answer : 用户从 (A,B,C,D)

中选择了哪个选项

correct_answer:该特定问题的正确答案是什么

correct : 1.0 表示用户回答正确

elapsed_time :它表示用户回答该问题所花费的时间(以分钟为单位)

时间戳:每次交互的 UNIX 时间戳

real_date : 我已添加此列并将时间戳转换为人类日期和时间

** user_*iD *** ** question_*id *** ** user_*answer *** ** correct_answer ** ** correct ** ** elapsed_*time *** ** solving_*id *** ** bundle_*id *** timestamp real_date
1 1 A A 1.0 5.00 1 b1 1547794902000 Friday, January 18, 2019 7:01:42 AM
1 2 D D 1.0 3.00 2 b2 1547795130000 Friday, January 18, 2019 7:05:30 AM
1 5 C C 1.0 7.00 5 b5 1547795370000 Friday, January 18, 2019 7:09:30 AM
2 10 C C 1.0 5.00 10 b10 1547806170000 Friday, January 18, 2019 10:09:30 AM
2 1 B B 1.0 15.0 1 b1 1547802150000 Friday, January 18, 2019 9:02:30 AM
2 15 A A 1.0 2.00 15 b15 1547803230000 Friday, January 18, 2019 9:20:30 AM
2 7 C C 1.0 5.00 7 b7 1547802730000 Friday, January 18, 2019 9:12:10 AM
3 12 A A 1.0 1.00 25 b12 1547771110000 Friday, January 18, 2019 12:25:10 AM
3 10 C C 1.0 2.00 10 b10 1547770810000 Friday, January 18, 2019 12:20:10 AM
3 3 D D 1.0 5.00 3 b3 1547770390000 Friday, January 18, 2019 12:13:10 AM
104 6 C C 1.0 6.00 6 b6 1553040610000 Wednesday, March 20, 2019 12:10:10 AM
104 4 A A 1.0 5.00 4 b4 1553040547000 Wednesday, March 20, 2019 12:09:07 AM
104 1 A A 1.0 2.00 1 b1 1553040285000 Wednesday, March 20, 2019 12:04:45 AM

我需要做一些编码,我不知道我应该做哪种编码以及如何做?

我需要下一个数据框看起来像这样:

user_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0

正如您在时间戳和 real_date 的帮助下看到的那样;每个用户的 question_id 未排序, 新数据框应包含用户与哪些包进行交互,并按时间排序。

我认为您正在寻找 LabelEncoder。首先导入库:

#Common Model Helpers
from sklearn.preprocessing import LabelEncoder

那么您应该能够将对象转换为类别:

    #CONVERT: convert objects to category 
    
    #code categorical data
    label = LabelEncoder()   
    dataset['question_id'] = label.fit_transform(dataset['question_id']
    dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
    dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])

或仅使用以下内容:

dataset.apply(LabelEncoder().fit_transform)

首先使用 groupbycumcount 为每个 bundle 元素创建最终值,然后旋转数据框。最后重新索引它以获取所有列:

bundle = [f'b{i}' for i in range(1, 16)]

values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)

out = (
  df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
    .reindex(bundle, axis=1, fill_value=0)
)

输出:

>>> out
bundle_id  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
user_iD                                                                    
1           1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
2           1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
3           0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
104         1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

>>> out.reset_index().rename_axis(columns=None)
   user_iD  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
0        1   1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
1        2   1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
2        3   0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
3      104   1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

缺乏更多 Pythonish 经验,我提出以下(部分评论)代码片段,即 未以任何方式优化,仅基于基本pandas.DataFrame API reference.

import pandas as pd
import io
import sys

data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''

df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp']) 

# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')

dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
    dict_answered[user_id] = 0
    if len( sys.argv) > 1 and sys.argv[1]:
        # supplied arg in the next line for better result readability
        df_new_values = [sys.argv[1].strip('"').strip("'")
            for x in range(len(df_new_headers)-1)]
    else:
        # zeroes (original assignment)
        df_new_values = [0 for x in range(len(df_new_headers)-1)]
    
    df_new_values.insert(0, user_id)
    df_new_data.append(df_new_values)

df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)

# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
    if aux[3] == 1.0:
        # add 1 to number of already answered questions for current user 
        dict_answered[aux[1]] += 1
        df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]    

print( df_new)

输出示例

示例:.\SO751715.py

   user_iD  b1  b2  b3  b4  b5  b6  b7  b8  b9  b10  b11  b12  b13  b14  b15
0        1   1   2   0   0   3   0   0   0   0    0    0    0    0    0    0
1        2   1   0   0   0   0   0   2   0   0    4    0    0    0    0    3
2        3   0   0   1   0   0   0   0   0   0    2    0    3    0    0    0
3      104   1   0   0   2   0   3   0   0   0    0    0    0    0    0    0

示例:.\SO751715.py .

   user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0        1  1  2  .  .  3  .  .  .  .   .   .   .   .   .   .
1        2  1  .  .  .  .  .  2  .  .   4   .   .   .   .   3
2        3  .  .  1  .  .  .  .  .  .   2   .   3   .   .   .
3      104  1  .  .  2  .  3  .  .  .   .   .   .   .   .   .

示例:.\SO751715.py ''

   user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0        1  1  2        3
1        2  1                 2         4                   3
2        3        1                     2       3
3      104  1        2     3