去均值数据并转换为 numpy 数组
De-mean the data and convert to numpy array
我正在尝试在 Movielens 1M 数据集上实现基本的矩阵分解电影推荐系统。但我被困在这里。我想做的是我需要做的是对数据进行去均值化(按每个用户的均值标准化)并将其从数据帧转换为 numpy 数组。
代码片段:
import pandas as pd
import numpy as np
ratings_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/ratings.dat', 'r').readlines()]
#users_list = [i.strip().split("::") for i in open('/users/nickbecker/Downloads/ml-1m/users.dat', 'r').readlines()]
movies_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/movies.dat', 'r').readlines()]
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()
R = R_df.to_numpy()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)
错误:
Traceback (most recent call last):
File "S:\TIP\Code\MF_orig.py", line 17, in <module>
user_ratings_mean = np.mean(R, axis = 1)
File "<__array_function__ internals>", line 6, in mean
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\fromnumeric.py", line 3257, in mean
out=out, **kwargs)
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\_methods.py", line 151, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: can only concatenate str (not "int") to str
编辑:
R 的值为:
[['5' 0 0 ... 0 0 0]
['5' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
['4' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
ratings_df:
UserID MovieID Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
... ... ... ... ...
1000204 6040 1091 1 956716541
1000205 6040 1094 5 956704887
1000206 6040 562 5 956704746
1000207 6040 1096 4 956715648
1000208 6040 1097 4 956715569
movies_df:
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
3878 3948 Meet the Parents (2000) Comedy
3879 3949 Requiem for a Dream (2000) Drama
3880 3950 Tigerland (2000) Drama
3881 3951 Two Family House (2000) Drama
3882 3952 Contender, The (2000) Drama|Thriller
[3883 rows x 3 columns]
数据集link:
http://files.grouplens.org/datasets/movielens/ml-1m.zip
它正在处理对象,甚至将 dtype 参数提供给 pandas 数据帧构造函数也没有将其转换为整数。
您必须明确地将其转换为 int:
ratings_list = [[int(j) for j in i.strip().split("::") if j] for i in open('ratings.txt', 'r').readlines()]
然后继续。我试过了,这很管用。
我正在尝试在 Movielens 1M 数据集上实现基本的矩阵分解电影推荐系统。但我被困在这里。我想做的是我需要做的是对数据进行去均值化(按每个用户的均值标准化)并将其从数据帧转换为 numpy 数组。
代码片段:
import pandas as pd
import numpy as np
ratings_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/ratings.dat', 'r').readlines()]
#users_list = [i.strip().split("::") for i in open('/users/nickbecker/Downloads/ml-1m/users.dat', 'r').readlines()]
movies_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/movies.dat', 'r').readlines()]
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()
R = R_df.to_numpy()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)
错误:
Traceback (most recent call last):
File "S:\TIP\Code\MF_orig.py", line 17, in <module>
user_ratings_mean = np.mean(R, axis = 1)
File "<__array_function__ internals>", line 6, in mean
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\fromnumeric.py", line 3257, in mean
out=out, **kwargs)
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\_methods.py", line 151, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: can only concatenate str (not "int") to str
编辑: R 的值为:
[['5' 0 0 ... 0 0 0]
['5' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
['4' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
ratings_df:
UserID MovieID Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
... ... ... ... ...
1000204 6040 1091 1 956716541
1000205 6040 1094 5 956704887
1000206 6040 562 5 956704746
1000207 6040 1096 4 956715648
1000208 6040 1097 4 956715569
movies_df:
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
3878 3948 Meet the Parents (2000) Comedy
3879 3949 Requiem for a Dream (2000) Drama
3880 3950 Tigerland (2000) Drama
3881 3951 Two Family House (2000) Drama
3882 3952 Contender, The (2000) Drama|Thriller
[3883 rows x 3 columns]
数据集link: http://files.grouplens.org/datasets/movielens/ml-1m.zip
它正在处理对象,甚至将 dtype 参数提供给 pandas 数据帧构造函数也没有将其转换为整数。
您必须明确地将其转换为 int:
ratings_list = [[int(j) for j in i.strip().split("::") if j] for i in open('ratings.txt', 'r').readlines()]
然后继续。我试过了,这很管用。