Cumulus Data Management Types
What Are The Cumulus Data Management Types
Collections: Collections are logical sets of data objects of the same data type and version. They provide contextual information used by Cumulus ingest.Granules: Granules are the smallest aggregation of data that can be independently managed. They are always associated with a collection, which is a grouping of granules.Providers: Providers generate and distribute input data that Cumulus obtains and sends to workflows.Rules: Rules tell Cumulus how to associate providers and collections and when/how to start processing a workflow.Workflows: Workflows are composed of one or more AWS Lambda Functions and ECS Activities to discover, ingest, process, manage, and archive data.Executions: Executions are records of a workflow.Reconciliation Reports: Reports are a comparison of data sets to check to see if they are in agreement and to help Cumulus users detect conflicts.
Interaction
- Providers tell Cumulus where to get new data - i.e. S3, HTTPS
- Collections tell Cumulus where to store the data files
- Rules tell Cumulus when to trigger a workflow execution and tie providers and collections together
Managing Data Management Types
The following are created via the dashboard or API:
- Providers
- Collections
- Rules
- Reconciliation reports
Granules are created by workflow executions and then can be managed via the dashboard or API.
An execution record is created for each workflow execution triggered and can be viewed in the dashboard or data can be retrieved via the API.
Workflows are created and managed via the Cumulus deployment.
Configuration Fields
Schemas
Looking at our API schema definitions can provide us with some insight into collections, providers, rules, and their attributes (and whether those are required or not). The schema for different concepts will be reference throughout this document.
The schemas are extremely useful for understanding which attributes are configurable and which of those are required. Cumulus uses these schemas for validation.
Providers
- Provider schema (
module.exports.provider) - Provider API
- Sample provider configurations
- While connection configuration is defined here, things that are more specific to a specific ingest setup (e.g. 'What target directory should we be pulling from' or 'How is duplicate handling configured?') are generally defined in a Rule or Collection, not the Provider.
- There is some provider behavior which is controlled by task-specific configuration and not the provider definition. This configuration has to be set on a per-workflow basis. For example, see the
httpListTimeoutconfiguration on thediscover-granulestask
Provider Configuration
The Provider configuration is defined by a JSON object that takes different configuration keys depending on the provider type. The following are definitions of typical configuration values relevant for the various providers:
Configuration by provider type
S3
| Key | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | Unique identifier for the provider |
| globalConnectionLimit | integer | No | Integer specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited |
| protocol | string | Yes | The protocol for this provider. Must be s3 for this provider type. |
| host | string | Yes | S3 Bucket to pull data from |
http
| Key | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | Unique identifier for the provider |
| globalConnectionLimit | integer | No | Integer specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited |
| protocol | string | Yes | The protocol for this provider. Must be http for this provider type |
| host | string | Yes | The host to pull data from (e.g. nasa.gov) |
| username | string | No | Configured username for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication |
| password | string | Only if username is specified | Configured password for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication |
| port | integer | No | Port to connect to the provider on. Defaults to 80 |
| allowedRedirects | string[] | No | Only hosts in this list will have the provider username/password forwarded for authentication. Entries should be specified as host.com or host.com:7000 if redirect port is different than the provider port. |
| certificateUri | string | No | SSL Certificate S3 URI for custom or self-signed SSL (TLS) certificate |
https
| Key | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | Unique identifier for the provider |
| globalConnectionLimit | integer | No | Integer specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited |
| protocol | string | Yes | The protocol for this provider. Must be https for this provider type |
| host | string | Yes | The host to pull data from (e.g. nasa.gov) |
| username | string | No | Configured username for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication |
| password | string | Only if username is specified | Configured password for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication |
| port | integer | No | Port to connect to the provider on. Defaults to 443 |
| allowedRedirects | string[] | No | Only hosts in this list will have the provider username/password forwarded for authentication. Entries should be specified as host.com or host.com:7000 if redirect port is different than the provider port. |
| certiciateUri | string | No | SSL Certificate S3 URI for custom or self-signed SSL (TLS) certificate |