The MongoDB Aggregation Framework (M121)

Chapter 1: Introduction and Aggregation Concepts

The concept of Pipelines:
- Composition of stages
- Stages are configurable to produce desired transformations
- Documents flow like parts in an assembly line
- Stages can be arranged in multiple ways
Aggregation Structure and Syntax
- Pipeline are always an array of one or more stages
- Each stage is a JSON object of key-value pairs, composed of either operators or expressions
  - Operator: aggregation operators & query operators
  - Expression: like functions, given arguments and generate a computed output
- Options may be passed in

Chapter 1: Basic Aggregation – $match & $project

$match
- Filter documents
- Uses standard MongoDB query operators
- Limitations:
  - Can’t use the $where operator
  - Must be the first stage if want to use a $test operator
  - $match does not allow for projection
  - The $match query syntax is identical to the read operation query syntax; i.e. $match does not accept raw aggregation expressions. (? )
$project
- Shape documents
- Functions:
  - Selectively remove and retain fields
    - Once specify one field to retain, must specify all fields need to retain
    - _id field is by default retained
  - Reassign existing field values and derive entirely new fields
- When selecting on subfields, must surround arguments with quotes

Lab
// $match -> Title is a string
  {
    $match: {
      title: {
        $type: "string"
      }
    }
  }

// $project -> split the title on spaces
{
  $project: {
    titleArray: { $split: ["$title", " "] },
    _id: 0
  }
}

// Size of the array 
  {
    $match: {
      titleArray: { $size: 1 }
    }
  }


// Array is not empty
{ $match: { writers: { $elemMatch: { $exists: true } } }

Chapter 2: Basic Aggregation – Utility Stages

$addFields
- Similar to project
- Functions:
  - New computed fields
  - Modify existing fields
$geoNear (?)
- Perform geoqueries within the pipeline
- Must be the first stage in a pipeline
- $geoNear can be used in sharded collections while $near can’t
- The collection can have one and only one 2dsphere index
- If using 2dsphere, the distance is returned in meters
Cursor-like Stages
- $sort can take advantage of indexes if used early within a pipeline
- By default, $sort will only use up to 100 megabytes of RAM, Setting allowDiskUse: true will allow for larger sorts

$sample
- Select a set of random documents in a collection
- {$sample: {size: <N, how many documents>}}
  - Case 1:
    - N <= 5% of the number of documents AND source collection has >= 100 documents AND $sample is the first stage
    - Pseudo-random cursor
  - Case 2:
    - Other conditions
    - In-memory random sort and select specific number of documents

Lab
// the 25th
  {
    $skip: 24
  },
  {
    $limit: 1
  }

$group
- Group documents according to criteria
  - { $group: { _id: <matching/grouping criteria> }}
- Accumulator expressions will ignore documents with a value at the specified field that isn’t of the type the expression expects or if the value is missing, will return null
- _id is where to specify what incoming documents should be grouped on
- Can use all accumulator expressions within $group
- $group can be used multiple times within a pipeline
- It may be necessary to sanitise incoming data

{
  $group:
    {
      _id: <expression>, // Group By Expression
      <field1>: { <accumulator1> : <expression1> },
      ...
    }
 }

Accumulator Expressions with $project
- The accumulator expressions within $project operate over an array in the current document, they do not carry values over all documents
- Available Accumulator Expressions:
  - $sum, $avg, $max, $min, $stdDevPop, $stdDevSam
- Expressions have no memory between documents
- May still have to use $reducer or $map for more complex calculations
- group all documents together -> _id: null
- $match: { awards: /Won \d{1,2} Oscars?/ }
$unwind
- Deconstructs an array field from the input documents to output a document for each element
- $unwind only works on array values
- There are two forms for unwind
- Using unwind on large collections with big documents may lead to performance issues
$lookup
- Collection in from field cannot be sharded and must exist within the same database
- $lookup forms a strict equality comparison
- Often after a lookup, we want to follow it with a match stage to filter documents out
- Lookup retrieves the entire document that matched
- as if already exists in the working document that field will be overwritten

{
   $lookup:
     {
       from: <collection to join>,
       localField: <field from the input documents>,
       foreignField: <field from the documents of the "from" collection>,
       as: <output array field>
     }
}

graphLookup (!)
- MongoDB -> general purpose database to support operational and analytical use cases
- Recursive common table expressions in SQL
- Transitive closure
- graphLookup allows looking up recursively a set of documents with a defined relationship to a starting document
- Limitation:
  - Memory allocation -> $allowDiskUse
  - Indexes -> connectToField
  - No Sharding

Chapter 4: Core Aggregation – Combining Information

Facet navigation to enable creating an interface that characterizes query results across multiple dimensions or facets
Faceting is a popular analytics capability that allow users to explore data by applying multiple filters and characterizations
$sortByCount
Buckets
- Group data by range of values
- { $bucket: {groupBy: “$x”, boundaries: [0, 50, 100]}}
  - Range includes the lower bound, excludes the upper bound
  - Boundaries should be the same data type
  - Add default bucket to group documents that not fall into ranges
- Auto Buckets
  - Instead of boundaries, specify the number of buckets
  - Granularity ensures the boundaries of all buckets adhere to a specified preferred number series (?)

{
  $bucketAuto: {
      groupBy: <expression>,
      buckets: <number>,
      output: {
         <output1>: { <$accumulator expression> },
         ...
      }
      granularity: <string>
  }
}

Multiple Facets
- “Sub-pipeline”

{ $facet:
   {
      <outputField1>: [ <stage1>, <stage2>, ... ],
      <outputField2>: [ <stage1>, <stage2>, ... ],
      ...

   }
}

Chapter 5: Miscellaneous Aggregation

$redact
- Restricts the contents of the documents based on information stored in the documents themselves
- For implementing access control
- $$KEEP and $$PRUNE automatically apply to all levels below the evaluated level
- $$DESCEND retains the current level and evaluates the next level down
- $redact is not for restricting access to a collection
$out
- To persist a result of aggregation
- Can’t be used within a facet
- Must honor existing indexes if specifying an existing collection

{ $out: "<output-collection>" } // Output collection is in the same database

$merge
- Prior to MongoDB 4.2 -> $out
  - must be unsharded
  - overwrites existing
- $merge collection
  - can exist
  - same or different “db”
  - can be sharded

{ $merge: {
     into: <collection> -or- { db: <db>, coll: <collection> },
     on: <identifier field> -or- [ <identifier field1>, ...],  // Optional
     let: <variables>,                                         // Optional
     whenMatched: <replace|keepExisting|merge|fail|pipeline>,  // Optional
     whenNotMatched: <insert|discard|fail>                     // Optional
} }

View
- Vertical slice
  - $project
  - The number of documents wont change
- Horizontal slice
  - $match
  - Modify shape of individual document
- How to create a View?
  - db.createView()
  - db.createCollection()
- Views contain no data themselves, they are stored aggregations that run when queried
- Read-only

Chapter 6: Aggregation Performance

“Realtime” Processing
- Provide data for applications
- Query performance is more important
Batch Processing
- Provide data for analytics
- Query performance is less important
Index Usage
- Some stages can make use of index, put them as front as possible
Memory Constraints
- Results are subject to 16MB document limit
  - Use $limit and $project
- 100MB of RAM per stage
  - use indexes
  - allowDiskUse, last option, do not work with $graphLookup

Aggregation Pipeline on a sharded cluster
- Primary shard will do the merge results:
  - $out
  - $facet
  - $lookup
  - $graphLookup
- Optimization
  - $match before $sort
  - $limit before $skip
Pipeline Optimization
- Avoid unnecessary stages, for example $project, the Aggregation Framework can project fields automatically if final shape of the output document can be determined from initial input
- Use accumulator expressions, $map, $reduce, $ filter in project before an $unwind
- Every high order array function can be implemented with $reduce if the provided expressions do not meet needs

// Exam
{
  $match: {
    src_airport: { $in: ["LHR", "JFK"] },
    dst_airport: { $in: ["LHR", "JFK"] }
  }
},

Chapter 1: Introduction and Aggregation Concepts

Chapter 1: Basic Aggregation – $match & $project

Chapter 2: Basic Aggregation – Utility Stages

Chapter 4: Core Aggregation – Combining Information

Chapter 5: Miscellaneous Aggregation

Chapter 6: Aggregation Performance

Share this:

Related

Leave a comment Cancel reply