我正在处理 python 问题以优化脚本
Im working on a python problem to optimize the script
- 将 option_labels 列的行值放入 headers 列
- 如果 option_labels 存在于特定的 user_id,我会在创建的新列中应用 option_values 的值,否则它将是 0。
示例数据为:(data.csv)
user_id country option_values option_labels
abc456 Germany 256gb SSD
abc123 Brazil i5 intel
xyz456 France 128gb SSD
xyz123 Turkey i7 intel
abc123 Brazil 2gb nvidia
abc456 Germany 32gb RAM
xyz123 Turkey 4gb nvidia
xyz456 France 16gb RAM
示例输出为:
user_id country option_values option_labels intel nvidia SSD RAM
abc456 Germany 256gb SSD 0 0 256gb 0
abc123 Brazil i5 intel i5 0 0 0
xyz456 France 256gb SSD 0 0 128gb 0
xyz123 Turkey i7 intel i7 0 0 0
abc123 Brazil 2gb nvidia 0 2gb 0 0
abc456 Germany 32gb RAM 0 0 0 32gb
xyz123 Turkey 4gb nvidia 0 4gb 0 0
xyz456 France 16gb RAM 0 0 0 16gb
我已经用下面的示例代码完成了这个过程,
import pandas as pd
import numpy as np
data_sample = pd.read_csv("data.csv")
feature_list = data_sample["option_label"].unique().tolist()
user_list = data_sample["user_id"].unique().tolist()
country_list = data_sample["country"].unique().tolist()
opt_val_list = data_sample["opt_val"].unique().tolist()
def filterd_id(check_id):
single_id_data= data_sample[data_sample['user_id'] == check_id]
return single_id_data
def finding_features(single_id_data):
user_features = single_id_data["option_labels"].unique().tolist()
return user_features
def check_feature(feature_list, user_features):
feature_prs_not = []
for i in feature_list:
if(i in user_features):
result = opt_val_list
else:
result = 0
feature_prs_not.append(result)
return feature_prs_not
user_id = []
country = []
for i in user_list:
check_id = i
user_id.append(i)
single_id_data = filterd_id(check_id)
c = single_id_data["country"].unique().tolist()
country.append(c)
user_features = finding_features(single_id_data)
feature_prst_not = check_feature(feature_list,user_features)
df = pd.DataFrame([feature_prst_not], columns = feature_list)
df_feature = df_feature.append(df)
df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
df_country = pd.DataFrame(country, columns=['country_name'])
我的近 10 万个 ID 的原始数据需要更多时间 运行(例如 .. 8-9 小时)。
我还在 Python 的学习阶段,我现在正在尝试优化以减少脚本的 运行 时间。
如果你想让它更快,你需要矢量化。我相信此代码会产生与您的相同的输出
import numpy as np
for val in df['option_labels'].unique():
df[val] = np.where(df['option_labels']==val, df['option_values'], 0)
我就是这样复制你的数据的
from io import StringIO
df = pd.read_csv(StringIO('''
"user_id","country","option_values","option_labels"
"abc456","Germany","256gb","SSD"
"abc123","Brazil","i5","intel"
"xyz456","France","128gb","SSD"
"xyz123","Turkey","i7","intel"
"abc123","Brazil","2gb","nvidia"
"abc456","Germany","32gb","RAM"
"xyz123","Turkey","4gb","nvidia"
"xyz456","France","16gb","RAM"'''))
- 将 option_labels 列的行值放入 headers 列
- 如果 option_labels 存在于特定的 user_id,我会在创建的新列中应用 option_values 的值,否则它将是 0。
示例数据为:(data.csv)
user_id country option_values option_labels
abc456 Germany 256gb SSD
abc123 Brazil i5 intel
xyz456 France 128gb SSD
xyz123 Turkey i7 intel
abc123 Brazil 2gb nvidia
abc456 Germany 32gb RAM
xyz123 Turkey 4gb nvidia
xyz456 France 16gb RAM
示例输出为:
user_id country option_values option_labels intel nvidia SSD RAM
abc456 Germany 256gb SSD 0 0 256gb 0
abc123 Brazil i5 intel i5 0 0 0
xyz456 France 256gb SSD 0 0 128gb 0
xyz123 Turkey i7 intel i7 0 0 0
abc123 Brazil 2gb nvidia 0 2gb 0 0
abc456 Germany 32gb RAM 0 0 0 32gb
xyz123 Turkey 4gb nvidia 0 4gb 0 0
xyz456 France 16gb RAM 0 0 0 16gb
我已经用下面的示例代码完成了这个过程,
import pandas as pd
import numpy as np
data_sample = pd.read_csv("data.csv")
feature_list = data_sample["option_label"].unique().tolist()
user_list = data_sample["user_id"].unique().tolist()
country_list = data_sample["country"].unique().tolist()
opt_val_list = data_sample["opt_val"].unique().tolist()
def filterd_id(check_id):
single_id_data= data_sample[data_sample['user_id'] == check_id]
return single_id_data
def finding_features(single_id_data):
user_features = single_id_data["option_labels"].unique().tolist()
return user_features
def check_feature(feature_list, user_features):
feature_prs_not = []
for i in feature_list:
if(i in user_features):
result = opt_val_list
else:
result = 0
feature_prs_not.append(result)
return feature_prs_not
user_id = []
country = []
for i in user_list:
check_id = i
user_id.append(i)
single_id_data = filterd_id(check_id)
c = single_id_data["country"].unique().tolist()
country.append(c)
user_features = finding_features(single_id_data)
feature_prst_not = check_feature(feature_list,user_features)
df = pd.DataFrame([feature_prst_not], columns = feature_list)
df_feature = df_feature.append(df)
df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
df_country = pd.DataFrame(country, columns=['country_name'])
我的近 10 万个 ID 的原始数据需要更多时间 运行(例如 .. 8-9 小时)。 我还在 Python 的学习阶段,我现在正在尝试优化以减少脚本的 运行 时间。
如果你想让它更快,你需要矢量化。我相信此代码会产生与您的相同的输出
import numpy as np
for val in df['option_labels'].unique():
df[val] = np.where(df['option_labels']==val, df['option_values'], 0)
我就是这样复制你的数据的
from io import StringIO
df = pd.read_csv(StringIO('''
"user_id","country","option_values","option_labels"
"abc456","Germany","256gb","SSD"
"abc123","Brazil","i5","intel"
"xyz456","France","128gb","SSD"
"xyz123","Turkey","i7","intel"
"abc123","Brazil","2gb","nvidia"
"abc456","Germany","32gb","RAM"
"xyz123","Turkey","4gb","nvidia"
"xyz456","France","16gb","RAM"'''))