Mongodb 查询具有多个值计数的聚合

Mongodb querying for aggregation with count of multiple values

我在我的一个 rails 应用程序中使用 Mongoid 来 mongodb

class Tracking
  include Mongoid::Document
  include Mongoid::Timestamps

  field :article_id,      type: String
  field :action,          type: String # like | comment
  field :actor_gender,    type: String # male | female | unknown

  field :city,            type: String
  field :state,           type: String
  field :country,         type: String
end

这里我要抓取这种表格格式的记录,

article_id | state | male_like_count | female_like_count | unknown_gender_like_count | date

juhkwu2367 | California | 21 | 7  | 1 | 11-20-2015
juhkwu2367 | New York   | 62 | 23 | 3 | 11-20-2015
juhkwu2367 | Vermont    | 48 | 27 | 3 | 11-20-2015
juhkwu2367 | California | 21 | 7  | 1 | 11-21-2015
juhkwu2367 | New York   | 62 | 23 | 3 | 11-21-2015
juhkwu2367 | Vermont    | 48 | 27 | 3 | 11-21-2015

此处的查询输入为:

article_id 
country
date range (from and to)
action (is `like` in this scenario)
sort_by [ date | state | male_like_count | female_like_count ]

这就是我正在尝试的,参考 https://docs.mongodb.org/v3.0/reference/operator/aggregation/group/

中的示例
db.trackings.aggregate(
   [
      {
        $group : {
           _id : { month: { $month: "$created_at" }, day: { $dayOfMonth: "$created_at" }, year: { $year: "$created_at" }, article_id:  "$article_id", state: "$state", country: "$country"},
           article_id: "$article_id",
           country: ??,
           state: "$state",
           male_like_count: { $sum:  ?? } },
           female_like_count: { $sum:  ?? } },
           unknown_gender_like_count: { $sum:  ?? } },
           date: ??
        }
      }
   ]
)

所以我应该在 ?? 的地方放什么来比较性别计数以及如何为 sorting_option 添加子句?

您主要是在寻找 $cond 运算符来评估条件和 return 特定计数器是否应该递增,但您还缺少一些其他聚合概念这里:

db.trackings.aggregate([
    { "$match": {
        "created_at": { "$gte": startDate, "$lt": endDate },
        "country": "US",
        "action": "like"
    }},
    { "$group": {
        "_id": { 
            "date": {
                "month": { "$month": "$created_at" }, 
                "day": { "$dayOfMonth": "$created_at" },
                "year": { "$year": "$created_at" }
            },
            "article_id":  "$article_id", 
            "state": "$state"
        },
        "male_like_count": { 
            "$sum": {
                "$cond": [
                    { "$eq": [ "$gender", "male" ] }                            
                    1,
                    0
                ]
            }
        },
        "female_like_count": { 
            "$sum": {
                "$cond": [
                    { "$eq": [ "$gender", "female" ] }                            
                    1,
                    0
                ]
            }
        },
        "unknown_like_count": { 
            "$sum": {
                "$cond": [
                    { "$eq": [ "$gender", "unknown" ] }                            
                    1,
                    0
                ]
            }
        }
      }},
      { "$sort": {
        "_id.date.year": 1,
        "_id.date.month": 1,
        "_id.date.day": 1,
        "_id.article_id": 1,
        "_id.state": 1,
        "male_like_count": 1,
        "female_like_count": 1
      }}
   ]
)

首先,您基本上想要 $match,这就是您为聚合管道提供 "query" 条件的方式。它基本上可以是任何流水线阶段,但首先使用时,它将过滤在后续操作中考虑的输入。在这种情况下,需要日期范围和国家/地区,并删除任何不是 "like" 的内容,因为您不担心这些计数。

然后所有项目按 _id 中的相应 "key" 分组。这可以并且用作复合字段,主要是因为所有这些字段值都被视为分组键的一部分,并且也用于一些组织。

您似乎还在 _id 本身之外的输出中询问 "distinct fields"。不要那样做。数据已经存在,因此没有必要复制它。您可以通过管道末端的 $first as an aggregation operator, or you could even use a $project 阶段在 _id 之外生成相同的内容以重命名字段。但是,最好不要养成您认为自己需要的习惯,因为这只会花费时间和/或 space 来获得回应。

如果有的话,你似乎比其他任何人都更想 "pretty date"。对于大多数操作,我个人更喜欢使用 "date math",因此适合 mongoid 的更改列表为:

Tracking.collection.aggregate([
    { "$match" => {
        "created_at" => { "$gte" => startDate, "$lt" => endDate },
        "country" => "US",
        "action" => "like"
    }},
    { "$group" => {
        "_id" => { 
            "date" => {
                "$add" => [
                    { "$subtract" => [
                        { "$subtract" => [ "$created_at", Time.at(0).utc.to_datetime ] },
                        { "$mod" => [
                            { "$subtract" => [ "$created_at", Time.at(0).utc.to_datetime ] },
                            1000 * 60 * 60 * 24
                        ]}
                    ]},
                    Time.at(0).utc.to_datetime
                ]
            },
            "article_id" =>  "$article_id", 
            "state" => "$state"
        },
        "male_like_count" => { 
            "$sum" => {
                "$cond" => [
                    { "$eq" => [ "$gender", "male" ] }                            
                    1,
                    0
                ]
            }
        },
        "female_like_count" => { 
            "$sum" => {
                "$cond" => [
                    { "$eq" => [ "$gender", "female" ] }                            
                    1,
                    0
                ]
            }
        },
        "unknown_like_count" => { 
            "$sum" => {
                "$cond" => [
                    { "$eq" =>[ "$gender", "unknown" ] }                            
                    1,
                    0
                ]
            }
        }
      }},
      { "$sort" => {
        "_id.date" => 1,
        "_id.article_id" => 1,
        "_id.state" => 1,
        "male_like_count" => 1,
        "female_like_count" => 1
      }}
])

这实际上归结为获得一个适合用作驱动程序参数的 DateTime 对象,该对象对应于纪元日期并进行各种操作。如果使用数字时间戳值处理 $subtract with one BSON Date and another will produce a numeric value that can be subsequently be rounded to the current day using the applied math. Then of course when using $add 到 BSON 日期(再次代表纪元),那么结果又是一个 BSON 日期对象,当然具有调整和四舍五入的值。

那么这只是再次应用 $sort 作为聚合管道阶段的问题,而不是外部修饰符。很像 $match 原则,聚合管道可以在任何地方排序,但最后总是处理最终结果。