使用 ray 和 numpy 高效地成对计算 similarity/dissimilarity
computing efficiently pairwise similarity/dissimilarity with ray and numpy
我想从 parquet 文件加载一个巨大的矩阵,并将距离计算分布在多个节点上,以节省内存并加快计算速度。
因此输入数据拥有 42000 行(特征)和 300000 列(样本):
X
sample1
sample2
sample3
feature1
0
1
1
feature2
1
0
1
feature3
0
0
1
header列和行放在这里描述输入数据
所以我还拥有一个示例列表 [sample1,sample2,sample3…]
可以提供帮助(通过使用 itertools.combinations
或其他人)
我想对每对样本应用交换函数。
使用 pandas,我这样做:
similarity = df[df[sample1] == df[sample2]][sample1].sum()
dissimilarity = df[df[sample1] != df[sample2]][sample1].sum()
score = similarity - dissimilarity
那么是否可以同时使用 numpy 的射线和广播方法来加速计算?
@Jaime answer's 非常接近我的需求。
也许我可以使用以下方法做 n 批样品:
batch1=[sample1,samlpe2,…]
data = pandas.read_parquet(somewhere, column=batch1 ).to_numpy()
感谢您的帮助
注1:10个样本的输入数据可以这样模拟:
import random
import numpy as np
foo = np.array([[random.randint(0,1) for _ in range(0,10)] for _ in range(0,30000)])
注2:我在一个节点上尝试了scipy的空间距离,但我没有足够的内存。这就是为什么我想将计算拆分到多个节点
这里只是提出一些想法,概述计算相似度的困难/最佳 (?) 方法:
import itertools as it
import numpy as np
n_samples, n_features = 42_000, 300_000
# usually the other way around
data = np.random.randint(0, 2, size=(n_samples, n_features), dtype=np.uint8)
# 42,000 * 300,000 = 12,600,000,000; already 12.6GB RAM just to load the entire data
# your similarity score can at best be n_features
# each sample has perfect similarity to itself
# storing each similarity in a matrix needs at least
# 300,000² = 90,000,000,000 * 4 (np.int32) = 360 GB of RAM
# np.int16 (-32768, 32767) won't be enough;
sim_mat = np.eye(n_samples, dtype=np.int32) * n_features
# fastest way of computing similarity I could come up with
# sim = (np.sum(data[i] == data[j]) - n_features/2) * 2
# same as np.sum(data[i] == data[j]) - np.sum(data[i] != data[j])
baseline = n_features/2
for i, j in it.combinations(range(n_samples), 2):
sim_mat[i, j] = sim_mat[j, i] = (np.sum(data[i] == data[j]) - baseline) * 2
一些可能有用的辅助函数:
def similarity_from_to(data: np.ndarray, from_i: int, to_i: int) -> int:
"""
Computes similarities from sample `data[from_i]` to sample `data[to_i]`
Parameters
----------
data : np.ndarray
2D data matrix of shape (N_samples, N_features)
from_i : int
index of first sample in [0, N_samples)
to_i : int
index of second sample in [0, N_samples)
Returns
-------
similarity : int
similarity-score in [-N_features/2, N_features]
"""
return int((np.sum(data[from_i] == data[to_i]) - data.shape[1]/2) * 2)
def similarities_from(data: np.ndarray, from_i: int):
"""
Computes similarities from sample `from_i` to all other samples
Parameters
----------
data : np.ndarray
2D data matrix of shape (N_samples, N_features)
from_i : int
index of target sample in [0, N_samples)
Returns
-------
similarities : np.ndarray
similarity-scores for all samples to data[`from_i`]; in shape (N_samples, )
"""
baseline = n_features/2
return np.asarray([(np.sum(data[from_i] == data[to_i]) - baseline) * 2 for to_i in range(len(data))], dtype=np.int32)
我想从 parquet 文件加载一个巨大的矩阵,并将距离计算分布在多个节点上,以节省内存并加快计算速度。
因此输入数据拥有 42000 行(特征)和 300000 列(样本):
X | sample1 | sample2 | sample3 |
---|---|---|---|
feature1 | 0 | 1 | 1 |
feature2 | 1 | 0 | 1 |
feature3 | 0 | 0 | 1 |
header列和行放在这里描述输入数据
所以我还拥有一个示例列表 [sample1,sample2,sample3…]
可以提供帮助(通过使用 itertools.combinations
或其他人)
我想对每对样本应用交换函数。 使用 pandas,我这样做:
similarity = df[df[sample1] == df[sample2]][sample1].sum()
dissimilarity = df[df[sample1] != df[sample2]][sample1].sum()
score = similarity - dissimilarity
那么是否可以同时使用 numpy 的射线和广播方法来加速计算?
@Jaime answer's 非常接近我的需求。
也许我可以使用以下方法做 n 批样品:
batch1=[sample1,samlpe2,…]
data = pandas.read_parquet(somewhere, column=batch1 ).to_numpy()
感谢您的帮助
注1:10个样本的输入数据可以这样模拟:
import random
import numpy as np
foo = np.array([[random.randint(0,1) for _ in range(0,10)] for _ in range(0,30000)])
注2:我在一个节点上尝试了scipy的空间距离,但我没有足够的内存。这就是为什么我想将计算拆分到多个节点
这里只是提出一些想法,概述计算相似度的困难/最佳 (?) 方法:
import itertools as it
import numpy as np
n_samples, n_features = 42_000, 300_000
# usually the other way around
data = np.random.randint(0, 2, size=(n_samples, n_features), dtype=np.uint8)
# 42,000 * 300,000 = 12,600,000,000; already 12.6GB RAM just to load the entire data
# your similarity score can at best be n_features
# each sample has perfect similarity to itself
# storing each similarity in a matrix needs at least
# 300,000² = 90,000,000,000 * 4 (np.int32) = 360 GB of RAM
# np.int16 (-32768, 32767) won't be enough;
sim_mat = np.eye(n_samples, dtype=np.int32) * n_features
# fastest way of computing similarity I could come up with
# sim = (np.sum(data[i] == data[j]) - n_features/2) * 2
# same as np.sum(data[i] == data[j]) - np.sum(data[i] != data[j])
baseline = n_features/2
for i, j in it.combinations(range(n_samples), 2):
sim_mat[i, j] = sim_mat[j, i] = (np.sum(data[i] == data[j]) - baseline) * 2
一些可能有用的辅助函数:
def similarity_from_to(data: np.ndarray, from_i: int, to_i: int) -> int:
"""
Computes similarities from sample `data[from_i]` to sample `data[to_i]`
Parameters
----------
data : np.ndarray
2D data matrix of shape (N_samples, N_features)
from_i : int
index of first sample in [0, N_samples)
to_i : int
index of second sample in [0, N_samples)
Returns
-------
similarity : int
similarity-score in [-N_features/2, N_features]
"""
return int((np.sum(data[from_i] == data[to_i]) - data.shape[1]/2) * 2)
def similarities_from(data: np.ndarray, from_i: int):
"""
Computes similarities from sample `from_i` to all other samples
Parameters
----------
data : np.ndarray
2D data matrix of shape (N_samples, N_features)
from_i : int
index of target sample in [0, N_samples)
Returns
-------
similarities : np.ndarray
similarity-scores for all samples to data[`from_i`]; in shape (N_samples, )
"""
baseline = n_features/2
return np.asarray([(np.sum(data[from_i] == data[to_i]) - baseline) * 2 for to_i in range(len(data))], dtype=np.int32)