Version: Next

Discover Granules

This task utilizes the Cumulus Message Adapter to interpret and construct incoming and outgoing messages.

Links to the npm package, task input, output and configuration schema definitions, and more can be found on the auto-generated Cumulus Tasks page.

Summary

The purpose of this task is to facilitate ingest of data that does not conform to either a PDR/SIPS discovery mechanism, a CNM Workflow or direct injection of workflow triggering events into Cumulus core components.

The task utilizes a defined collection in concert with a defined provider to scan a location for files matching the defined collection configuration, assemble those files into groupings by granule, and passes the constructed granules object as an output.

The constructed granules object is defined by the collection passed in the configuration, and has impacts to other provided core Cumulus Tasks.

Users of this task in a workflow are encouraged to carefully consider their configuration in context of downstream tasks and workflows.

Task Inputs

Each of the following sections are a high-level discussion of the intent of the various input/output/config values.

For the most recent config.json schema, please see the Cumulus Tasks page entry for the schema.

Input

This task does not expect an incoming payload.

Cumulus Configuration

This task does expect values to be set in the task_config CMA parameters for the workflows. A schema exists that defines the requirements for the task.

For the most recent config.json schema, please see the Cumulus Tasks page entry for the schema.

Below are expanded descriptions of selected config keys:

Provider

A Cumulus provider object. Used to define connection information for a location to scan for granule discovery.

Buckets

A list of buckets with types that will be used to assign bucket targets based on the collection configuration.

Collection

A Cumulus collection object. Used to define granule file groupings and granule metadata for discovered files. The collection object utilizes the collection type key to generate types in the output object on discovery.

DuplicateGranuleHandling

A string configuration that configures the step to filter the granules discovered:

skip: Duplicates will be filtered from the granules object
error: Duplicates encountered will result the step throwing an error
replace, version: Duplicates will be included in the granules object

The possible values match the collection.duplicateHandling and the task configuration can be set to use the collection.duplicateHandling by configuring this value to: "duplicateGranuleHandling": "{$.meta.collection.duplicateHandling}".

Ignore Files Configuration (`ignoreFilesConfigForDiscovery`)

The boolean property ignoreFilesConfigForDiscovery indicates whether or not to ignore the files configuration for a collection during granule discovery.

By default, this property is false, meaning that during discovery, a collection's files configuration is used to select which files to include in a granule's file list, such that only files with names that match one of the regular expressions specified in the collection's files configuration are added to the granule's file list.

This property supports cases where such file filtering is not desired during the discovery phase. By setting this property to true, a collection's files configuration is ignored, such that all files for a granule are included in a granule's file list. That is, no such filtering based on filename occurs as described above.

When set on the task configuration, the value applies to all collections during discovery. Otherwise, this property may be set on individual collections.

Concurrency

A number property that determines the level of concurrency with which granule duplicate checks are performed when duplicateGranuleHandling is skip or error.

Limiting concurrency helps to avoid throttling by the AWS Lambda API and helps to avoid encountering account Lambda concurrency limitations.

We do not recommend increasing this value unless you are seeing Lambda.Timeout errors when discover-granules discovers a large number of granules with skip or error duplicate handling. However, as increasing the concurrency may lead to Lambda API or Lambda concurrency throttling errors, you may wish to consider converting the discover-granules task to an ECS activity, which does not face similar runtime constraints.

The default value is 3.

Task Outputs

This task outputs an assembled array of Cumulus granule objects as the payload for the next task, and returns only the expected payload for the next task.

Summary​

Task Inputs​

Input​

Cumulus Configuration​

Provider​

Buckets​

Collection​

DuplicateGranuleHandling​

Ignore Files Configuration (ignoreFilesConfigForDiscovery)​

Concurrency​

Task Outputs​