基于 encoding/convert 的用户及其在 pandas 中的交互
User based encoding/convert with its interaction in pandas
我有这样的数据框:
user_id : 代表用户
question_id : 代表题号
user_answer : 用户从 (A,B,C,D)
中选择了哪个选项
correct_answer:该特定问题的正确答案是什么
correct : 1.0 表示用户回答正确
elapsed_time :它表示用户回答该问题所花费的时间(以分钟为单位)
时间戳:每次交互的 UNIX 时间戳
real_date : 我已添加此列并将时间戳转换为人类日期和时间
** user_*iD ***
** question_*id ***
** user_*answer ***
** correct_answer **
** correct **
** elapsed_*time ***
** solving_*id ***
** bundle_*id ***
timestamp
real_date
1
1
A
A
1.0
5.00
1
b1
1547794902000
Friday, January 18, 2019 7:01:42 AM
1
2
D
D
1.0
3.00
2
b2
1547795130000
Friday, January 18, 2019 7:05:30 AM
1
5
C
C
1.0
7.00
5
b5
1547795370000
Friday, January 18, 2019 7:09:30 AM
2
10
C
C
1.0
5.00
10
b10
1547806170000
Friday, January 18, 2019 10:09:30 AM
2
1
B
B
1.0
15.0
1
b1
1547802150000
Friday, January 18, 2019 9:02:30 AM
2
15
A
A
1.0
2.00
15
b15
1547803230000
Friday, January 18, 2019 9:20:30 AM
2
7
C
C
1.0
5.00
7
b7
1547802730000
Friday, January 18, 2019 9:12:10 AM
3
12
A
A
1.0
1.00
25
b12
1547771110000
Friday, January 18, 2019 12:25:10 AM
3
10
C
C
1.0
2.00
10
b10
1547770810000
Friday, January 18, 2019 12:20:10 AM
3
3
D
D
1.0
5.00
3
b3
1547770390000
Friday, January 18, 2019 12:13:10 AM
104
6
C
C
1.0
6.00
6
b6
1553040610000
Wednesday, March 20, 2019 12:10:10 AM
104
4
A
A
1.0
5.00
4
b4
1553040547000
Wednesday, March 20, 2019 12:09:07 AM
104
1
A
A
1.0
2.00
1
b1
1553040285000
Wednesday, March 20, 2019 12:04:45 AM
我需要做一些编码,我不知道我应该做哪种编码以及如何做?
我需要下一个数据框看起来像这样:
user_id
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
1
1
2
0
0
3
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
2
0
0
0
0
3
3
0
0
1
0
0
0
0
0
0
2
0
3
0
0
0
104
1
0
0
2
0
3
0
0
0
0
0
0
0
0
0
正如您在时间戳和 real_date 的帮助下看到的那样;每个用户的 question_id 未排序,
新数据框应包含用户与哪些包进行交互,并按时间排序。
我认为您正在寻找 LabelEncoder。首先导入库:
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
那么您应该能够将对象转换为类别:
#CONVERT: convert objects to category
#code categorical data
label = LabelEncoder()
dataset['question_id'] = label.fit_transform(dataset['question_id']
dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])
或仅使用以下内容:
dataset.apply(LabelEncoder().fit_transform)
首先使用 groupby
和 cumcount
为每个 bundle
元素创建最终值,然后旋转数据框。最后重新索引它以获取所有列:
bundle = [f'b{i}' for i in range(1, 16)]
values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)
out = (
df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
.reindex(bundle, axis=1, fill_value=0)
)
输出:
>>> out
bundle_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
user_iD
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
>>> out.reset_index().rename_axis(columns=None)
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
缺乏更多 Pythonish 经验,我提出以下(部分评论)代码片段,即 未以任何方式优化,仅基于基本pandas.DataFrame
API reference.
import pandas as pd
import io
import sys
data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''
df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp'])
# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')
dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
dict_answered[user_id] = 0
if len( sys.argv) > 1 and sys.argv[1]:
# supplied arg in the next line for better result readability
df_new_values = [sys.argv[1].strip('"').strip("'")
for x in range(len(df_new_headers)-1)]
else:
# zeroes (original assignment)
df_new_values = [0 for x in range(len(df_new_headers)-1)]
df_new_values.insert(0, user_id)
df_new_data.append(df_new_values)
df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)
# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
if aux[3] == 1.0:
# add 1 to number of already answered questions for current user
dict_answered[aux[1]] += 1
df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]
print( df_new)
输出示例
示例:.\SO751715.py
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
示例:.\SO751715.py .
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 . . 3 . . . . . . . . . .
1 2 1 . . . . . 2 . . 4 . . . . 3
2 3 . . 1 . . . . . . 2 . 3 . . .
3 104 1 . . 2 . 3 . . . . . . . . .
示例:.\SO751715.py ''
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 3
1 2 1 2 4 3
2 3 1 2 3
3 104 1 2 3
我有这样的数据框:
user_id : 代表用户
question_id : 代表题号
user_answer : 用户从 (A,B,C,D)
中选择了哪个选项correct_answer:该特定问题的正确答案是什么
correct : 1.0 表示用户回答正确
elapsed_time :它表示用户回答该问题所花费的时间(以分钟为单位)
时间戳:每次交互的 UNIX 时间戳
real_date : 我已添加此列并将时间戳转换为人类日期和时间
** user_*iD *** | ** question_*id *** | ** user_*answer *** | ** correct_answer ** | ** correct ** | ** elapsed_*time *** | ** solving_*id *** | ** bundle_*id *** | timestamp | real_date |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | A | A | 1.0 | 5.00 | 1 | b1 | 1547794902000 | Friday, January 18, 2019 7:01:42 AM |
1 | 2 | D | D | 1.0 | 3.00 | 2 | b2 | 1547795130000 | Friday, January 18, 2019 7:05:30 AM |
1 | 5 | C | C | 1.0 | 7.00 | 5 | b5 | 1547795370000 | Friday, January 18, 2019 7:09:30 AM |
2 | 10 | C | C | 1.0 | 5.00 | 10 | b10 | 1547806170000 | Friday, January 18, 2019 10:09:30 AM |
2 | 1 | B | B | 1.0 | 15.0 | 1 | b1 | 1547802150000 | Friday, January 18, 2019 9:02:30 AM |
2 | 15 | A | A | 1.0 | 2.00 | 15 | b15 | 1547803230000 | Friday, January 18, 2019 9:20:30 AM |
2 | 7 | C | C | 1.0 | 5.00 | 7 | b7 | 1547802730000 | Friday, January 18, 2019 9:12:10 AM |
3 | 12 | A | A | 1.0 | 1.00 | 25 | b12 | 1547771110000 | Friday, January 18, 2019 12:25:10 AM |
3 | 10 | C | C | 1.0 | 2.00 | 10 | b10 | 1547770810000 | Friday, January 18, 2019 12:20:10 AM |
3 | 3 | D | D | 1.0 | 5.00 | 3 | b3 | 1547770390000 | Friday, January 18, 2019 12:13:10 AM |
104 | 6 | C | C | 1.0 | 6.00 | 6 | b6 | 1553040610000 | Wednesday, March 20, 2019 12:10:10 AM |
104 | 4 | A | A | 1.0 | 5.00 | 4 | b4 | 1553040547000 | Wednesday, March 20, 2019 12:09:07 AM |
104 | 1 | A | A | 1.0 | 2.00 | 1 | b1 | 1553040285000 | Wednesday, March 20, 2019 12:04:45 AM |
我需要做一些编码,我不知道我应该做哪种编码以及如何做?
我需要下一个数据框看起来像这样:
user_id | b1 | b2 | b3 | b4 | b5 | b6 | b7 | b8 | b9 | b10 | b11 | b12 | b13 | b14 | b15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 2 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 3 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 0 | 0 | 0 |
104 | 1 | 0 | 0 | 2 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
正如您在时间戳和 real_date 的帮助下看到的那样;每个用户的 question_id 未排序, 新数据框应包含用户与哪些包进行交互,并按时间排序。
我认为您正在寻找 LabelEncoder。首先导入库:
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
那么您应该能够将对象转换为类别:
#CONVERT: convert objects to category
#code categorical data
label = LabelEncoder()
dataset['question_id'] = label.fit_transform(dataset['question_id']
dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])
或仅使用以下内容:
dataset.apply(LabelEncoder().fit_transform)
首先使用 groupby
和 cumcount
为每个 bundle
元素创建最终值,然后旋转数据框。最后重新索引它以获取所有列:
bundle = [f'b{i}' for i in range(1, 16)]
values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)
out = (
df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
.reindex(bundle, axis=1, fill_value=0)
)
输出:
>>> out
bundle_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
user_iD
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
>>> out.reset_index().rename_axis(columns=None)
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
缺乏更多 Pythonish 经验,我提出以下(部分评论)代码片段,即 未以任何方式优化,仅基于基本pandas.DataFrame
API reference.
import pandas as pd
import io
import sys
data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''
df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp'])
# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')
dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
dict_answered[user_id] = 0
if len( sys.argv) > 1 and sys.argv[1]:
# supplied arg in the next line for better result readability
df_new_values = [sys.argv[1].strip('"').strip("'")
for x in range(len(df_new_headers)-1)]
else:
# zeroes (original assignment)
df_new_values = [0 for x in range(len(df_new_headers)-1)]
df_new_values.insert(0, user_id)
df_new_data.append(df_new_values)
df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)
# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
if aux[3] == 1.0:
# add 1 to number of already answered questions for current user
dict_answered[aux[1]] += 1
df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]
print( df_new)
输出示例
示例:.\SO751715.py
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
示例:.\SO751715.py .
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 . . 3 . . . . . . . . . .
1 2 1 . . . . . 2 . . 4 . . . . 3
2 3 . . 1 . . . . . . 2 . 3 . . .
3 104 1 . . 2 . 3 . . . . . . . . .
示例:.\SO751715.py ''
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 3
1 2 1 2 4 3
2 3 1 2 3
3 104 1 2 3