Pandas

Question

我有一堆地理数据如下。我想按经度 0.2 度和纬度 0.2 度的分箱对数据进行分组。

虽然对纬度或经度做这件事很简单，但对这两个变量最合适的做法是什么？

|User_ID  |Latitude  |Longitude|Datetime           |u    |v    |
|---------|----------|---------|-------------------|-----|-----|
|222583401|41.4020375|2.1478710|2014-07-06 20:49:20|0.3  | 0.2 |
|287280509|41.3671346|2.0793115|2013-01-30 09:25:47|0.2  | 0.7 |
|329757763|41.5453577|2.1175164|2012-09-25 08:40:59|0.5  | 0.8 |
|189757330|41.5844998|2.5621569|2013-10-01 11:55:20|0.4  | 0.4 |
|624921653|41.5931846|2.3030671|2013-07-09 20:12:20|1.2  | 1.4 |
|414673119|41.5550136|2.0965829|2014-02-24 20:15:30|2.3  | 0.6 |
|414673119|41.5550136|2.0975829|2014-02-24 20:16:30|4.3  | 0.7 |
|414673119|41.5550136|2.0985829|2014-02-24 20:17:30|0.6  | 0.9 |

到目前为止，我所做的是创建了 2 个线性空间：

lonbins = np.linspace(df.Longitude.min(), df.Longitude.max(), 10) 
latbins = np.linspace(df.Latitude.min(), df.Latitude.max(), 10)

然后我可以使用 groupBy：

groups = df.groupby(pd.cut(df.Longitude, lonbins))

然后我显然可以遍历这些组以创建第二个级别。我的目标是对每个组进行统计分析，并可能将它们显示在看起来不太方便的地图上。

bucket = {}
for name, group in groups: 
    print name bucket[name] = group.groupby(pd.cut(group.Latitude, latbins))

例如，我想制作一个热图，显示每个 latlon 框的行数，显示每个 latlon 框中的速度分布，...

Answer 1

这个怎么样？

step = 0.2
to_bin = lambda x: np.floor(x / step) * step
df["latBin"] = to_bin(df.Latitude)
df["lonBin"] = to_bin(df.Longitude)
groups = df.groupby(["latBin", "lonBin"])

Answer 2

这是一个明显更快的解决方案：

选项 1：

如果您希望 binEdges 精确地每隔 0.2 度并且在 [66.6, 66.8, 67.0, .....] 这样的浮点数处使用此选项：

import numpy as np
step = 0.2
boundingBox = {"lat":
               {"min": np.floor(df.Latitude.min()/step)*step,
                "max": np.ceil(df.Latitude.max()/step)*step},
               "lon":
               {"min": np.floor(df.Longitude.min()/step)*step,
                "max": np.ceil(df.Longitude.max()/step)*step}
               }
noOfLatEdges = int(
    (boundingBox["lat"]["max"] - boundingBox["lat"]["min"]) / step)
noOfLonEdges = int(
    (boundingBox["lon"]["max"] - boundingBox["lon"]["min"]) / step)
latBins = np.linspace(boundingBox["lat"]["min"],
                      boundingBox["lat"]["max"], noOfLatEdges)
lonBins = np.linspace(boundingBox["lon"]["min"],
                      boundingBox["lon"]["max"], noOfLatEdges)
H, _, _ = np.histogram2d(df.Latitude, df.Longitude, bins=[latBins, lonBins])
binnedData = H.T # Important as otherwise the axes are wrong way around (I missed this for ages, see "Notes" of Numpy docs for histogram2d())

选项 2：

如果您希望 binEdges 几乎恰好每 0.2 度并且在像 [66.653112, 66.853112, 67.053112, .....] 这样的浮点数处使用此选项：

import numpy as np
step = 0.2
boundingBox = {"lat":
               {"min": df.Latitude.min(), "max": df.Latitude.max()},
               "lon":
               {"min": df.Longitude.min(), "max": df.Longitude.max()}
               }
noOfLatEdges = int(
    (boundingBox["lat"]["max"] - boundingBox["lat"]["min"]) / step)
noOfLonEdges = int(
    (boundingBox["lon"]["max"] - boundingBox["lon"]["min"]) / step)
H, xedges, yedges = np.histogram2d(df.Latitude, df.Longitude, bins=[
                                   noOfLatEdges, noOfLonEdges])
binnedData = H.T # Important as otherwise the axes are wrong way around (I missed this for ages, see "Notes" of Numpy docs for histogram2d())

选项 1、2 与 Martin Valgur 接受的解决方案之间的运行时比较：

这个答案对很多人来说似乎更长，但是我最近做了一个项目，其中这是一个时间关键组件运行非常频繁，所以这个答案极大地帮助了我（%明智）减少了我们的 API计算时间。

在 16k 行 DataFrame 上计算的运行时间分为 964 x 1381 个桶。

Martin Valgur 的当前已接受答案 运行时间为：

~~30.1 ms ± 799 µs 每个循环（平均 ± std.dev. of 7 运行s，每个循环 10 个）~~

编辑：当 bin 计算以矢量化方式完成时，可能接近 1.5 毫秒。

选项 1 有运行时间：

6.78 ms ± 130 µs 每个循环（平均 ± std.dev. of 7 运行s，每个循环 100 个）

选项 2 有运行时间：

8.17 ms ± 164 µs 每个循环（平均 ± std.dev. of 7 运行s，每个循环 100 个）

因此，选项 1 比当前解决方案快近 5 倍。但是，如果桶的维数对于现代内存来说太大了，Martin 的Pandas 解决方案可能会更高效。

Pandas - Group/bins 每个 longitude/latitude 的数据

Pandas - Group/bins of data per longitude/latitude

python

binning

选项 1：

选项 2：

选项 1、2 与 Martin Valgur 接受的解决方案之间的运行时比较：