解决 pandas 个问题的并行编程方法
Parallel programming approach to solve pandas problems
我有以下格式的数据框。
df
A B Target
5 4 3
1 3 4
我正在使用 pd.DataFrame(df.corr().iloc[:-1,-1])
.
查找每列(Target 除外)与 Target 列的相关性
但问题是 - 我的实际数据帧的大小是 (216, 72391)
,在我的系统上至少需要 30 分钟来处理。有什么方法可以使用 gpu 对其进行并行化吗?我需要多次查找相似类型的值,所以不能等待每次 30 分钟的正常处理时间。
你应该看看 dask。它应该能够做你想做的事以及更多。
它并行化了大部分 DataFrame 函数。
在这里,我尝试使用numba
来实现你的操作
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link 到 colab notebook.
我有以下格式的数据框。
df
A B Target
5 4 3
1 3 4
我正在使用 pd.DataFrame(df.corr().iloc[:-1,-1])
.
查找每列(Target 除外)与 Target 列的相关性
但问题是 - 我的实际数据帧的大小是 (216, 72391)
,在我的系统上至少需要 30 分钟来处理。有什么方法可以使用 gpu 对其进行并行化吗?我需要多次查找相似类型的值,所以不能等待每次 30 分钟的正常处理时间。
你应该看看 dask。它应该能够做你想做的事以及更多。 它并行化了大部分 DataFrame 函数。
在这里,我尝试使用numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link 到 colab notebook.