如何确定哪条曲线最接近给定的一组点?
How can I determine which curve is closest to a given set of points?
我有几个数据框,每个数据框包含两列 x 和 y 值,因此每一行代表曲线上的一个点。然后,不同的数据框代表地图上的等高线。我有另一系列数据点(数量较少),我想看看它们平均最接近哪个轮廓。
我想确定从每个数据点到曲线上每个点的距离,使用 sqrt(x^2+y^2) - sqrt(x_1^2 + y_1^2)
,将它们加到曲线上的每个点。麻烦的是曲线上有几千个点,而要评估的数据点只有几十个,所以我不能简单地将它们放在一起的列中。
我想我需要循环遍历数据点,检查它们与曲线中每个点之间的平方距离。
我不知道是否有一个简单的功能或模块可以做到这一点。
提前致谢!
编辑:感谢您的评论。 @Alexander:我已经使用示例数据集尝试了 vectorize 函数,如下所示。我实际上使用的是包含数千个数据点的轮廓,要比较的数据集超过 100 个,所以我希望能够尽可能地自动化。我目前能够根据我的轮廓创建从第一个数据点开始的距离测量,但理想情况下我也想循环通过 j 。当我尝试时,出现错误:
import numpy as np
from numpy import vectorize
import pandas as pd
from pandas import DataFrame
df1 = {'X1':['1', '2', '2', '3'], 'Y1':['2', '5', '7', '9']}
df1 = DataFrame(df1, columns=['X1', 'Y1'])
df2 = {'X2':['3', '5', '6'], 'Y2':['10', '15', '16']}
df2 = DataFrame(df2, columns=['X2', 'Y2'])
df1=df1.astype(float)
df2=df2.astype(float)
Distance=pd.DataFrame()
i = range(0, len(df1))
j = range(0, len(df2))
def myfunc(x1, y1, x2, y2):
return np.sqrt((x2-x1)**2+np.sqrt(y2-y1)**2)
vfunc=np.vectorize(myfunc)
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[0]['X2'], df2.iloc[0]['Y2'])
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[1]['X2'], df2.iloc[1]['Y2'])
Distance
对于距离,您需要将公式更改为
def getDistance(x, y, x_i, y_i):
return sqrt((x_i -x)^2 + (y_i - y)^2)
(x,y) 是您的数据点,(x_i, y_i) 是曲线中的一个点。
考虑使用 NumPy 进行矢量化。根据您的用例,显式循环遍历您的数据点很可能效率较低,但它可能足够快。 (如果你需要定期 运行 它,我认为矢量化很容易超过显式方式)这可能看起来像这样:
import numpy as np # Universal abbreviation for the module
datapoints = np.random.rand(3,2) # Returns a vector with randomized entries of size 3x2 (Imagine it as 3 sets of x- and y-values
contour1 = np.random.rand(1000, 2) # Other than the size (which is 1000x2) no different than datapoints
contour2 = np.random.rand(1000, 2)
contour3 = np.random.rand(1000, 2)
def squareDistanceUnvectorized(datapoint, contour):
retVal = 0.
print("Using datapoint with values x:{}, y:{}".format(datapoint[0], datapoint[1]))
lengthOfContour = np.size(contour, 0) # This gets you the number of lines in the vector
for pointID in range(lengthOfContour):
squaredXDiff = np.square(contour[pointID,0] - datapoint[0])
squaredYDiff = np.square(contour[pointID,1] - datapoint[1])
retVal += np.sqrt(squaredXDiff + squaredYDiff)
retVal = retVal / lengthOfContour # As we want the average, we are dividing the sum by the element count
return retVal
if __name__ == "__main__":
noOfDatapoints = np.size(datapoints,0)
contID = 0
for currentDPID in range(noOfDatapoints):
dist1 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour1)
dist2 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour2)
dist3 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour3)
if dist1 > dist2 and dist1 > dist3:
contID = 1
elif dist2 > dist1 and dist2 > dist3:
contID = 2
elif dist3 > dist1 and dist3 > dist2:
contID = 3
else:
contID = 0
if contID == 0:
print("Datapoint {} is inbetween two contours".format(currentDPID))
else:
print("Datapoint {} is closest to contour {}".format(currentDPID, contID))
好的,现在继续vector-land。
我冒昧地将这部分调整为我认为是您的数据集。试一试,如果有效请告诉我。
import numpy as np
import pandas as pd
# Generate 1000 points (2-dim Vector) with random values between 0 and 1. Make them strings afterwards.
# This is the first contour
random2Ddata1 = np.random.rand(1000,2)
listOfX1 = [str(x) for x in random2Ddata1[:,0]]
listOfY1 = [str(y) for y in random2Ddata1[:,1]]
# Do the same for a second contour, except that we de-center this 255 units into the first dimension
random2Ddata2 = np.random.rand(1000,2)+[255,0]
listOfX2 = [str(x) for x in random2Ddata2[:,0]]
listOfY2 = [str(y) for y in random2Ddata2[:,1]]
# After this step, our 'contours' are basically two blobs of datapoints whose centers are approx. 255 units apart.
# Generate a set of 4 datapoints and make them a Pandas-DataFrame
datapoints = {'X': ['0.5', '0', '255.5', '0'], 'Y': ['0.5', '0', '0.5', '-254.5']}
datapoints = pd.DataFrame(datapoints, columns=['X', 'Y'])
# Do the same for the two contours
contour1 = {'Xf': listOfX1, 'Yf': listOfY1}
contour1 = pd.DataFrame(contour1, columns=['Xf', 'Yf'])
contour2 = {'Xf': listOfX2, 'Yf': listOfY2}
contour2 = pd.DataFrame(contour2, columns=['Xf', 'Yf'])
# We do now have 4 datapoints.
# - The first datapoint is basically where we expect the mean of the first contour to be.
# Contour 1 consists of 1000 points with x, y- values between 0 and 1
# - The second datapoint is at the origin. Its distances should be similar to the once of the first datapoint
# - The third datapoint would be the result of shifting the first datapoint 255 units into the positive first dimension
# - The fourth datapoint would be the result of shifting the first datapoint 255 units into the negative second dimension
# Transformation into numpy array
# First the x and y values of the data points
dpArray = ((datapoints.values).T).astype(np.float)
c1Array = ((contour1.values).T).astype(np.float)
c2Array = ((contour2.values).T).astype(np.float)
# This did the following:
# - Transform the datapoints and contours into numpy arrays
# - Transpose them afterwards so that if we want all x values, we can write var[0,:] instead of var[:,0].
# A personal preference, maybe
# - Convert all the values into floats.
# Now, we iterate through the contours. If you have a lot of them, putting them into a list beforehand would do the job
for contourid, contour in enumerate([c1Array, c2Array]):
# Now for the datapoints
for _index, _value in enumerate(dpArray[0,:]):
# The next two lines do vectorization magic.
# First, we square the difference between one dpArray entry and the contour x values.
# You might notice that contour[0,:] returns an 1x1000 vector while dpArray[0,_index] is an 1x1 float value.
# This works because dpArray[0,_index] is broadcasted to fit the size of contour[0,:].
dx = np.square(dpArray[0,_index] - contour[0,:])
# The same happens for dpArray[1,_index] and contour[1,:]
dy = np.square(dpArray[1,_index] - contour[1,:])
# Now, we take (for one datapoint and one contour) the mean value and print it.
# You could write it into an array or do basically anything with it that you can imagine
distance = np.mean(np.sqrt(dx+dy))
print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
# But you want to be able to call this... so here we go, generating a function out of it!
def getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, listOfContourDataFrames):
""" Takes a DataFrame with points and a list of different contours to return the average distance for each combination"""
dpArray = ((datapoints.values).T).astype(np.float)
listOfContours = []
for item in listOfContourDataFrames:
listOfContours.append(((item.values).T).astype(np.float))
retVal = np.zeros((np.size(dpArray,1), len(listOfContours)))
for contourid, contour in enumerate(listOfContours):
for _index, _value in enumerate(dpArray[0,:]):
dx = np.square(dpArray[0,_index] - contour[0,:])
dy = np.square(dpArray[1,_index] - contour[1,:])
distance = np.mean(np.sqrt(dx+dy))
print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
retVal[_index, contourid] = distance
return retVal
# And just to see that it is, indeed, returning the same results, run it once
getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, [contour1, contour2])
总体思路
"curve"实际上是一个有很多点的多边形。肯定有一些库可以计算多边形和点之间的距离。但通常它会是这样的:
- 计算 "approximate distance" 到整个多边形,例如到多边形的边界框(从点到 4 条线段),或到边界框的中心
- 计算到多边形线的距离。如果你有太多的点,那么多边形的额外步骤 "resolution" 可能会减少。
- 找到的最小距离是点到多边形的距离。
- 对每个点和每个多边形重复
现有解决方案
一些图书馆已经可以做到这一点:
- shapely question, shapely Geo-Python docs
scipy.spatial.distance
:scipy可用于计算任意点之间的距离
numpy.linalg.norm(point1-point2)
:一些答案提出了使用 numpy 计算距离的不同方法。有些甚至显示性能基准
sklearn.neighbors
:不是真正关于曲线和到它们的距离,但是如果你想检查 "to which area point is most likely related" 可以使用
- 而且您始终可以使用
D(x1, y1, x2, y2) = sqrt((x₂-x₁)² + (y₂-y₁)²)
自己计算距离,并搜索给出最小距离的最佳点组合
示例:
# get distance from points of 1 dataset to all the points of another dataset
from scipy.spatial import distance
d = distance.cdist(df1.to_numpy(), df2.to_numpy(), 'euclidean')
print(d)
# Results will be a matrix of all possible distances:
# [[ D(Point_df1_0, Point_df2_0), D(Point_df1_0, Point_df2_1), D(Point_df1_0, Point_df2_2)]
# [ D(Point_df1_1, Point_df2_0), D(Point_df1_1, Point_df2_1), D(Point_df1_1, Point_df2_2)]
# [ D(Point_df1_3, Point_df2_0), D(Point_df1_2, Point_df2_1), D(Point_df1_2, Point_df2_2)]
# [ D(Point_df1_3, Point_df2_0), D(Point_df1_3, Point_df2_1), D(Point_df1_3, Point_df2_2)]]
[[ 8.24621125 13.60147051 14.86606875]
[ 5.09901951 10.44030651 11.70469991]
[ 3.16227766 8.54400375 9.8488578 ]
[ 1. 6.32455532 7.61577311]]
接下来要做什么由您决定。例如,作为 "general distance between curves" 的指标,您可以:
- 在每一行和每一列中选择最小值(如果你跳过一些 columns/rows,那么你最终可能会得到“只匹配轮廓的一部分”的候选值),并计算它们的中位数:
np.median(np.hstack([np.amin(d, axis) for axis in range(len(d.shape))]))
.
或者你可以计算平均值:
- 所有距离:
np.median(d)
- 共 "smallest 2/3 of distances" 个:
np.median(d[d<np.percentile(d, 66, interpolation='higher')])
- 共"smallest distances that cover at least each rows and each columns":
for min_value in np.sort(d, None):
chosen_indices = d<=min_value
if np.all(np.hstack([np.amax(chosen_indices, axis) for axis in range(len(chosen_indices.shape))])):
break
similarity = np.median(d[chosen_indices])
或者您可以从一开始就使用不同类型的距离(例如 "correlation distance" 看起来很有希望完成您的任务)
也许 "Procrustes analysis, a similarity test for two data sets" 和距离一起使用。
也许您可以使用 minkowski distance 作为相似性度量。
替代方法
另一种方法是使用一些 "geometry" 库来比较凹包的面积:
为轮廓和 "candidate datapoints" 构建凹包(不容易,但可能:using shapely , using concaveman)。但是如果你确定你的轮廓已经有序并且没有重叠段,那么你可以直接从这些点构建多边形而不需要凹包。
使用"intersection area"减去"non-common area"作为相似度的度量(shapely
can):
- Non-common 区域是:
union - intersection
或只是 "symmetric difference"
- 最终指标:
intersection.area - symmetric_difference.area
(intersection, area)
这种方法在某些情况下可能比处理距离更好,例如:
- 您想 "fewer points covering whole area" 胜过 "huge amount of very close points that cover only half of the area"
- 比较不同分数的候选人更明显
但它也有它的缺点(只是在纸上画一些例子并通过实验找到它们)
其他想法:
您可以不使用多边形或凹包:
- 构建 linear ring from your points and then use
contour.buffer(some_distance)
。这样你就可以忽略轮廓的 "internal area" 并且只比较轮廓本身(公差为 some_distance
)。质心之间的距离(或它的两倍)可以用作 some_distance
的值
- 您可以使用
ops.polygonize
从片段构建 polygons/lines
而不是使用 intersection.area - symmetric_difference.area
你可以:
- Snap 一个对象到另一个对象,然后将捕捉到的对象与原始对象进行比较
在比较真实对象之前,您可以比较对象的 "simpler" 版本以过滤掉明显的不匹配:
- 例如你可以检查是否 boundaries of objects intersect
- 或者您可以 simplify 几何形状,然后再比较它们
我有几个数据框,每个数据框包含两列 x 和 y 值,因此每一行代表曲线上的一个点。然后,不同的数据框代表地图上的等高线。我有另一系列数据点(数量较少),我想看看它们平均最接近哪个轮廓。
我想确定从每个数据点到曲线上每个点的距离,使用 sqrt(x^2+y^2) - sqrt(x_1^2 + y_1^2)
,将它们加到曲线上的每个点。麻烦的是曲线上有几千个点,而要评估的数据点只有几十个,所以我不能简单地将它们放在一起的列中。
我想我需要循环遍历数据点,检查它们与曲线中每个点之间的平方距离。 我不知道是否有一个简单的功能或模块可以做到这一点。 提前致谢!
编辑:感谢您的评论。 @Alexander:我已经使用示例数据集尝试了 vectorize 函数,如下所示。我实际上使用的是包含数千个数据点的轮廓,要比较的数据集超过 100 个,所以我希望能够尽可能地自动化。我目前能够根据我的轮廓创建从第一个数据点开始的距离测量,但理想情况下我也想循环通过 j 。当我尝试时,出现错误:
import numpy as np
from numpy import vectorize
import pandas as pd
from pandas import DataFrame
df1 = {'X1':['1', '2', '2', '3'], 'Y1':['2', '5', '7', '9']}
df1 = DataFrame(df1, columns=['X1', 'Y1'])
df2 = {'X2':['3', '5', '6'], 'Y2':['10', '15', '16']}
df2 = DataFrame(df2, columns=['X2', 'Y2'])
df1=df1.astype(float)
df2=df2.astype(float)
Distance=pd.DataFrame()
i = range(0, len(df1))
j = range(0, len(df2))
def myfunc(x1, y1, x2, y2):
return np.sqrt((x2-x1)**2+np.sqrt(y2-y1)**2)
vfunc=np.vectorize(myfunc)
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[0]['X2'], df2.iloc[0]['Y2'])
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[1]['X2'], df2.iloc[1]['Y2'])
Distance
对于距离,您需要将公式更改为
def getDistance(x, y, x_i, y_i):
return sqrt((x_i -x)^2 + (y_i - y)^2)
(x,y) 是您的数据点,(x_i, y_i) 是曲线中的一个点。
考虑使用 NumPy 进行矢量化。根据您的用例,显式循环遍历您的数据点很可能效率较低,但它可能足够快。 (如果你需要定期 运行 它,我认为矢量化很容易超过显式方式)这可能看起来像这样:
import numpy as np # Universal abbreviation for the module
datapoints = np.random.rand(3,2) # Returns a vector with randomized entries of size 3x2 (Imagine it as 3 sets of x- and y-values
contour1 = np.random.rand(1000, 2) # Other than the size (which is 1000x2) no different than datapoints
contour2 = np.random.rand(1000, 2)
contour3 = np.random.rand(1000, 2)
def squareDistanceUnvectorized(datapoint, contour):
retVal = 0.
print("Using datapoint with values x:{}, y:{}".format(datapoint[0], datapoint[1]))
lengthOfContour = np.size(contour, 0) # This gets you the number of lines in the vector
for pointID in range(lengthOfContour):
squaredXDiff = np.square(contour[pointID,0] - datapoint[0])
squaredYDiff = np.square(contour[pointID,1] - datapoint[1])
retVal += np.sqrt(squaredXDiff + squaredYDiff)
retVal = retVal / lengthOfContour # As we want the average, we are dividing the sum by the element count
return retVal
if __name__ == "__main__":
noOfDatapoints = np.size(datapoints,0)
contID = 0
for currentDPID in range(noOfDatapoints):
dist1 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour1)
dist2 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour2)
dist3 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour3)
if dist1 > dist2 and dist1 > dist3:
contID = 1
elif dist2 > dist1 and dist2 > dist3:
contID = 2
elif dist3 > dist1 and dist3 > dist2:
contID = 3
else:
contID = 0
if contID == 0:
print("Datapoint {} is inbetween two contours".format(currentDPID))
else:
print("Datapoint {} is closest to contour {}".format(currentDPID, contID))
好的,现在继续vector-land。
我冒昧地将这部分调整为我认为是您的数据集。试一试,如果有效请告诉我。
import numpy as np
import pandas as pd
# Generate 1000 points (2-dim Vector) with random values between 0 and 1. Make them strings afterwards.
# This is the first contour
random2Ddata1 = np.random.rand(1000,2)
listOfX1 = [str(x) for x in random2Ddata1[:,0]]
listOfY1 = [str(y) for y in random2Ddata1[:,1]]
# Do the same for a second contour, except that we de-center this 255 units into the first dimension
random2Ddata2 = np.random.rand(1000,2)+[255,0]
listOfX2 = [str(x) for x in random2Ddata2[:,0]]
listOfY2 = [str(y) for y in random2Ddata2[:,1]]
# After this step, our 'contours' are basically two blobs of datapoints whose centers are approx. 255 units apart.
# Generate a set of 4 datapoints and make them a Pandas-DataFrame
datapoints = {'X': ['0.5', '0', '255.5', '0'], 'Y': ['0.5', '0', '0.5', '-254.5']}
datapoints = pd.DataFrame(datapoints, columns=['X', 'Y'])
# Do the same for the two contours
contour1 = {'Xf': listOfX1, 'Yf': listOfY1}
contour1 = pd.DataFrame(contour1, columns=['Xf', 'Yf'])
contour2 = {'Xf': listOfX2, 'Yf': listOfY2}
contour2 = pd.DataFrame(contour2, columns=['Xf', 'Yf'])
# We do now have 4 datapoints.
# - The first datapoint is basically where we expect the mean of the first contour to be.
# Contour 1 consists of 1000 points with x, y- values between 0 and 1
# - The second datapoint is at the origin. Its distances should be similar to the once of the first datapoint
# - The third datapoint would be the result of shifting the first datapoint 255 units into the positive first dimension
# - The fourth datapoint would be the result of shifting the first datapoint 255 units into the negative second dimension
# Transformation into numpy array
# First the x and y values of the data points
dpArray = ((datapoints.values).T).astype(np.float)
c1Array = ((contour1.values).T).astype(np.float)
c2Array = ((contour2.values).T).astype(np.float)
# This did the following:
# - Transform the datapoints and contours into numpy arrays
# - Transpose them afterwards so that if we want all x values, we can write var[0,:] instead of var[:,0].
# A personal preference, maybe
# - Convert all the values into floats.
# Now, we iterate through the contours. If you have a lot of them, putting them into a list beforehand would do the job
for contourid, contour in enumerate([c1Array, c2Array]):
# Now for the datapoints
for _index, _value in enumerate(dpArray[0,:]):
# The next two lines do vectorization magic.
# First, we square the difference between one dpArray entry and the contour x values.
# You might notice that contour[0,:] returns an 1x1000 vector while dpArray[0,_index] is an 1x1 float value.
# This works because dpArray[0,_index] is broadcasted to fit the size of contour[0,:].
dx = np.square(dpArray[0,_index] - contour[0,:])
# The same happens for dpArray[1,_index] and contour[1,:]
dy = np.square(dpArray[1,_index] - contour[1,:])
# Now, we take (for one datapoint and one contour) the mean value and print it.
# You could write it into an array or do basically anything with it that you can imagine
distance = np.mean(np.sqrt(dx+dy))
print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
# But you want to be able to call this... so here we go, generating a function out of it!
def getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, listOfContourDataFrames):
""" Takes a DataFrame with points and a list of different contours to return the average distance for each combination"""
dpArray = ((datapoints.values).T).astype(np.float)
listOfContours = []
for item in listOfContourDataFrames:
listOfContours.append(((item.values).T).astype(np.float))
retVal = np.zeros((np.size(dpArray,1), len(listOfContours)))
for contourid, contour in enumerate(listOfContours):
for _index, _value in enumerate(dpArray[0,:]):
dx = np.square(dpArray[0,_index] - contour[0,:])
dy = np.square(dpArray[1,_index] - contour[1,:])
distance = np.mean(np.sqrt(dx+dy))
print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
retVal[_index, contourid] = distance
return retVal
# And just to see that it is, indeed, returning the same results, run it once
getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, [contour1, contour2])
总体思路
"curve"实际上是一个有很多点的多边形。肯定有一些库可以计算多边形和点之间的距离。但通常它会是这样的:
- 计算 "approximate distance" 到整个多边形,例如到多边形的边界框(从点到 4 条线段),或到边界框的中心
- 计算到多边形线的距离。如果你有太多的点,那么多边形的额外步骤 "resolution" 可能会减少。
- 找到的最小距离是点到多边形的距离。
- 对每个点和每个多边形重复
现有解决方案
一些图书馆已经可以做到这一点:
- shapely question, shapely Geo-Python docs
scipy.spatial.distance
:scipy可用于计算任意点之间的距离numpy.linalg.norm(point1-point2)
:一些答案提出了使用 numpy 计算距离的不同方法。有些甚至显示性能基准sklearn.neighbors
:不是真正关于曲线和到它们的距离,但是如果你想检查 "to which area point is most likely related" 可以使用
- 而且您始终可以使用
D(x1, y1, x2, y2) = sqrt((x₂-x₁)² + (y₂-y₁)²)
自己计算距离,并搜索给出最小距离的最佳点组合
示例:
# get distance from points of 1 dataset to all the points of another dataset
from scipy.spatial import distance
d = distance.cdist(df1.to_numpy(), df2.to_numpy(), 'euclidean')
print(d)
# Results will be a matrix of all possible distances:
# [[ D(Point_df1_0, Point_df2_0), D(Point_df1_0, Point_df2_1), D(Point_df1_0, Point_df2_2)]
# [ D(Point_df1_1, Point_df2_0), D(Point_df1_1, Point_df2_1), D(Point_df1_1, Point_df2_2)]
# [ D(Point_df1_3, Point_df2_0), D(Point_df1_2, Point_df2_1), D(Point_df1_2, Point_df2_2)]
# [ D(Point_df1_3, Point_df2_0), D(Point_df1_3, Point_df2_1), D(Point_df1_3, Point_df2_2)]]
[[ 8.24621125 13.60147051 14.86606875]
[ 5.09901951 10.44030651 11.70469991]
[ 3.16227766 8.54400375 9.8488578 ]
[ 1. 6.32455532 7.61577311]]
接下来要做什么由您决定。例如,作为 "general distance between curves" 的指标,您可以:
- 在每一行和每一列中选择最小值(如果你跳过一些 columns/rows,那么你最终可能会得到“只匹配轮廓的一部分”的候选值),并计算它们的中位数:
np.median(np.hstack([np.amin(d, axis) for axis in range(len(d.shape))]))
. 或者你可以计算平均值:
- 所有距离:
np.median(d)
- 共 "smallest 2/3 of distances" 个:
np.median(d[d<np.percentile(d, 66, interpolation='higher')])
- 共"smallest distances that cover at least each rows and each columns":
- 所有距离:
for min_value in np.sort(d, None):
chosen_indices = d<=min_value
if np.all(np.hstack([np.amax(chosen_indices, axis) for axis in range(len(chosen_indices.shape))])):
break
similarity = np.median(d[chosen_indices])
或者您可以从一开始就使用不同类型的距离(例如 "correlation distance" 看起来很有希望完成您的任务)
也许 "Procrustes analysis, a similarity test for two data sets" 和距离一起使用。
也许您可以使用 minkowski distance 作为相似性度量。
替代方法
另一种方法是使用一些 "geometry" 库来比较凹包的面积:
为轮廓和 "candidate datapoints" 构建凹包(不容易,但可能:using shapely , using concaveman)。但是如果你确定你的轮廓已经有序并且没有重叠段,那么你可以直接从这些点构建多边形而不需要凹包。
使用"intersection area"减去"non-common area"作为相似度的度量(
shapely
can):- Non-common 区域是:
union - intersection
或只是 "symmetric difference" - 最终指标:
intersection.area - symmetric_difference.area
(intersection, area)
- Non-common 区域是:
这种方法在某些情况下可能比处理距离更好,例如:
- 您想 "fewer points covering whole area" 胜过 "huge amount of very close points that cover only half of the area"
- 比较不同分数的候选人更明显
但它也有它的缺点(只是在纸上画一些例子并通过实验找到它们)
其他想法:
您可以不使用多边形或凹包:
- 构建 linear ring from your points and then use
contour.buffer(some_distance)
。这样你就可以忽略轮廓的 "internal area" 并且只比较轮廓本身(公差为some_distance
)。质心之间的距离(或它的两倍)可以用作some_distance
的值
- 您可以使用
ops.polygonize
从片段构建 polygons/lines
- 构建 linear ring from your points and then use
而不是使用
intersection.area - symmetric_difference.area
你可以:- Snap 一个对象到另一个对象,然后将捕捉到的对象与原始对象进行比较
在比较真实对象之前,您可以比较对象的 "simpler" 版本以过滤掉明显的不匹配:
- 例如你可以检查是否 boundaries of objects intersect
- 或者您可以 simplify 几何形状,然后再比较它们