Categories
MongoDB

The MongoDB Aggregation Framework (M121)

Chapter 1: Introduction and Aggregation Concepts

  • The concept of Pipelines:
    • Composition of stages
    • Stages are configurable to produce desired transformations
    • Documents flow like parts in an assembly line
    • Stages can be arranged in multiple ways
  • Aggregation Structure and Syntax
    • Pipeline are always an array of one or more stages
    • Each stage is a JSON object of key-value pairs, composed of either operators or expressions
      • Operator: aggregation operators & query operators
      • Expression: like functions, given arguments and generate a computed output
    • Options may be passed in

Chapter 1: Basic Aggregation – $match & $project

  • $match
    • Filter documents
    • Uses standard MongoDB query operators
    • Limitations:
  • $project
    • Shape documents
    • Functions:
      • Selectively remove and retain fields
        • Once specify one field to retain, must specify all fields need to retain
        • _id field is by default retained
      • Reassign existing field values and derive entirely new fields
    • When selecting on subfields, must surround arguments with quotes
Lab
// $match -> Title is a string
  {
    $match: {
      title: {
        $type: "string"
      }
    }
  }

// $project -> split the title on spaces
{
  $project: {
    titleArray: { $split: ["$title", " "] },
    _id: 0
  }
}

// Size of the array 
  {
    $match: {
      titleArray: { $size: 1 }
    }
  }


// Array is not empty
{ $match: { writers: { $elemMatch: { $exists: true } } }

Chapter 2: Basic Aggregation – Utility Stages

  • $addFields
    • Similar to project
    • Functions:
      • New computed fields
      • Modify existing fields
  • $geoNear (?)
    • Perform geoqueries within the pipeline
    • Must be the first stage in a pipeline
    • $geoNear can be used in sharded collections while $near can’t
    • The collection can have one and only one 2dsphere index
    • If using 2dsphere, the distance is returned in meters
  • Cursor-like Stages
    • $sort can take advantage of indexes if used early within a pipeline
    • By default, $sort will only use up to 100 megabytes of RAM, Setting allowDiskUse: true will allow for larger sorts
  • $sample
    • Select a set of random documents in a collection
    • {$sample: {size: <N, how many documents>}}
      • Case 1:
        • N <= 5% of the number of documents AND source collection has >= 100 documents AND $sample is the first stage
        • Pseudo-random cursor
      • Case 2:
        • Other conditions
        • In-memory random sort and select specific number of documents
Lab
// the 25th
  {
    $skip: 24
  },
  {
    $limit: 1
  }
  • $group
    • Group documents according to criteria
      • { $group: { _id: <matching/grouping criteria> }}
    • Accumulator expressions will ignore documents with a value at the specified field that isn’t of the type the expression expects or if the value is missing, will return null
    • _id is where to specify what incoming documents should be grouped on
    • Can use all accumulator expressions within $group
    • $group can be used multiple times within a pipeline
    • It may be necessary to sanitise incoming data
{
  $group:
    {
      _id: <expression>, // Group By Expression
      <field1>: { <accumulator1> : <expression1> },
      ...
    }
 }
  • Accumulator Expressions with $project
    • The accumulator expressions within $project operate over an array in the current document, they do not carry values over all documents
    • Available Accumulator Expressions:
      • $sum, $avg, $max, $min, $stdDevPop, $stdDevSam
    • Expressions have no memory between documents
    • May still have to use $reducer or $map for more complex calculations
    • group all documents together -> _id: null
    • $match: { awards: /Won \d{1,2} Oscars?/ }
  • $unwind
    • Deconstructs an array field from the input documents to output a document for each element
    • $unwind only works on array values
    • There are two forms for unwind
    • Using unwind on large collections with big documents may lead to performance issues
  • $lookup
    • Collection in from field cannot be sharded and must exist within the same database
    • $lookup forms a strict equality comparison
    • Often after a lookup, we want to follow it with a match stage to filter documents out
    • Lookup retrieves the entire document that matched
    • as if already exists in the working document that field will be overwritten
{
   $lookup:
     {
       from: <collection to join>,
       localField: <field from the input documents>,
       foreignField: <field from the documents of the "from" collection>,
       as: <output array field>
     }
}
  • graphLookup (!)
    • MongoDB -> general purpose database to support operational and analytical use cases
    • Recursive common table expressions in SQL
    • Transitive closure
    • graphLookup allows looking up recursively a set of documents with a defined relationship to a starting document
    • Limitation:
      • Memory allocation -> $allowDiskUse
      • Indexes -> connectToField
      • No Sharding

Chapter 4: Core Aggregation – Combining Information

  • Facet navigation to enable creating an interface that characterizes query results across multiple dimensions or facets
  • Faceting is a popular analytics capability that allow users to explore data by applying multiple filters and characterizations
  • $sortByCount
  • Buckets
    • Group data by range of values
    • { $bucket: {groupBy: “$x”, boundaries: [0, 50, 100]}}
      • Range includes the lower bound, excludes the upper bound
      • Boundaries should be the same data type
      • Add default bucket to group documents that not fall into ranges
    • Auto Buckets
      • Instead of boundaries, specify the number of buckets
      • Granularity ensures the boundaries of all buckets adhere to a specified preferred number series (?)
{
  $bucketAuto: {
      groupBy: <expression>,
      buckets: <number>,
      output: {
         <output1>: { <$accumulator expression> },
         ...
      }
      granularity: <string>
  }
}
  • Multiple Facets
    • “Sub-pipeline”
{ $facet:
   {
      <outputField1>: [ <stage1>, <stage2>, ... ],
      <outputField2>: [ <stage1>, <stage2>, ... ],
      ...

   }
}

Chapter 5: Miscellaneous Aggregation

  • $redact
    • Restricts the contents of the documents based on information stored in the documents themselves
    • For implementing access control
    • $$KEEP and $$PRUNE automatically apply to all levels below the evaluated level
    • $$DESCEND retains the current level and evaluates the next level down
    • $redact is not for restricting access to a collection
  • $out
    • To persist a result of aggregation
    • Can’t be used within a facet
    • Must honor existing indexes if specifying an existing collection
{ $out: "<output-collection>" } // Output collection is in the same database
  • $merge
    • Prior to MongoDB 4.2 -> $out
      • must be unsharded
      • overwrites existing
    • $merge collection
      • can exist
      • same or different “db”
      • can be sharded
{ $merge: {
     into: <collection> -or- { db: <db>, coll: <collection> },
     on: <identifier field> -or- [ <identifier field1>, ...],  // Optional
     let: <variables>,                                         // Optional
     whenMatched: <replace|keepExisting|merge|fail|pipeline>,  // Optional
     whenNotMatched: <insert|discard|fail>                     // Optional
} }
  • View
    • Vertical slice
      • $project
      • The number of documents wont change
    • Horizontal slice
      • $match
      • Modify shape of individual document
    • How to create a View?
      • db.createView()
      • db.createCollection()
    • Views contain no data themselves, they are stored aggregations that run when queried
    • Read-only

Chapter 6: Aggregation Performance

  • “Realtime” Processing
    • Provide data for applications
    • Query performance is more important
  • Batch Processing
    • Provide data for analytics
    • Query performance is less important
  • Index Usage
    • Some stages can make use of index, put them as front as possible
  • Memory Constraints
    • Results are subject to 16MB document limit
      • Use $limit and $project
    • 100MB of RAM per stage
      • use indexes
      • allowDiskUse, last option, do not work with $graphLookup
  • Aggregation Pipeline on a sharded cluster
    • Primary shard will do the merge results:
      • $out
      • $facet
      • $lookup
      • $graphLookup
    • Optimization
      • $match before $sort
      • $limit before $skip
  • Pipeline Optimization
    • Avoid unnecessary stages, for example $project, the Aggregation Framework can project fields automatically if final shape of the output document can be determined from initial input
    • Use accumulator expressions, $map, $reduce, $ filter in project before an $unwind
    • Every high order array function can be implemented with $reduce if the provided expressions do not meet needs
// Exam
{
  $match: {
    src_airport: { $in: ["LHR", "JFK"] },
    dst_airport: { $in: ["LHR", "JFK"] }
  }
},

Leave a comment