在pyspark中构建三角距离矩阵数据框?
Constructing a triangular distance matrix dataframe in pyspark?
我想使用 pyspark 中数据帧的值构建一个距离矩阵。我现在拥有的是
+----+-------------+
| id | list |
+----+-------------+
| 1 | [a, b, ...] |
+----+-------------+
| 2 | [c, d, ...] |
+----+-------------+
| 3 | [e, f, ...] |
+----+-------------+
我想用我自己的距离函数做一些类似的事情
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
dist = calculate_distance(features[i], features[j])
add_row_to_distance_df([ids[i], ids[j], dist])
编辑: 预期输出为
+-----+-----+-----------------------------+
| id1 | id2 | dist |
+-----+-----+-----------------------------+
| 1 | 2 | d([a, b, ...], [c, d, ...]) |
+-----+-----+-----------------------------+
| 1 | 3 | d([a, b, ...], [e, f, ...]) |
+-----+-----+-----------------------------+
| 2 | 3 | d([c, d, ...], [e, f, ...]) |
+-----+-----+-----------------------------+
我该怎么做?
您可以使用 cartesian()
和 filter()
只是必要的三角形,例如:
In []:
def calculate_distance(a, b):
return f'd({a}, {b})' # Py 3.6
rdd = sc.parallelize([(1, ['a', 'b', 'c']), (2, ['c', 'd', 'e']), (3, ['e', 'f', 'g'])])
(rdd.cartesian(rdd)
.filter(lambda x: x[0][0] < x[1][0])
.map(lambda x: (x[0][0], x[1][0], calculate_distance(x[0][1], x[1][1])))
.collect())
Out[]:
[(1, 2, "d(['a', 'b', 'c'], ['c', 'd', 'e'])"),
(1, 3, "d(['a', 'b', 'c'], ['e', 'f', 'g'])"),
(2, 3, "d(['c', 'd', 'e'], ['e', 'f', 'g'])")]
我想使用 pyspark 中数据帧的值构建一个距离矩阵。我现在拥有的是
+----+-------------+
| id | list |
+----+-------------+
| 1 | [a, b, ...] |
+----+-------------+
| 2 | [c, d, ...] |
+----+-------------+
| 3 | [e, f, ...] |
+----+-------------+
我想用我自己的距离函数做一些类似的事情
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
dist = calculate_distance(features[i], features[j])
add_row_to_distance_df([ids[i], ids[j], dist])
编辑: 预期输出为
+-----+-----+-----------------------------+
| id1 | id2 | dist |
+-----+-----+-----------------------------+
| 1 | 2 | d([a, b, ...], [c, d, ...]) |
+-----+-----+-----------------------------+
| 1 | 3 | d([a, b, ...], [e, f, ...]) |
+-----+-----+-----------------------------+
| 2 | 3 | d([c, d, ...], [e, f, ...]) |
+-----+-----+-----------------------------+
我该怎么做?
您可以使用 cartesian()
和 filter()
只是必要的三角形,例如:
In []:
def calculate_distance(a, b):
return f'd({a}, {b})' # Py 3.6
rdd = sc.parallelize([(1, ['a', 'b', 'c']), (2, ['c', 'd', 'e']), (3, ['e', 'f', 'g'])])
(rdd.cartesian(rdd)
.filter(lambda x: x[0][0] < x[1][0])
.map(lambda x: (x[0][0], x[1][0], calculate_distance(x[0][1], x[1][1])))
.collect())
Out[]:
[(1, 2, "d(['a', 'b', 'c'], ['c', 'd', 'e'])"),
(1, 3, "d(['a', 'b', 'c'], ['e', 'f', 'g'])"),
(2, 3, "d(['c', 'd', 'e'], ['e', 'f', 'g'])")]