Skip to main content
Version: Next

Cumulus Data Management Types

What Are The Cumulus Data Management Types

  • Collections: Collections are logical sets of data objects of the same data type and version. They provide contextual information used by Cumulus ingest.
  • Granules: Granules are the smallest aggregation of data that can be independently managed. They are always associated with a collection, which is a grouping of granules.
  • Providers: Providers generate and distribute input data that Cumulus obtains and sends to workflows.
  • Rules: Rules tell Cumulus how to associate providers and collections and when/how to start processing a workflow.
  • Workflows: Workflows are composed of one or more AWS Lambda Functions and ECS Activities to discover, ingest, process, manage, and archive data.
  • Executions: Executions are records of a workflow.
  • Reconciliation Reports: Reports are a comparison of data sets to check to see if they are in agreement and to help Cumulus users detect conflicts.

Interaction

  • Providers tell Cumulus where to get new data - i.e. S3, HTTPS
  • Collections tell Cumulus where to store the data files
  • Rules tell Cumulus when to trigger a workflow execution and tie providers and collections together

Managing Data Management Types

The following are created via the dashboard or API:

  • Providers
  • Collections
  • Rules
  • Reconciliation reports

Granules are created by workflow executions and then can be managed via the dashboard or API.

An execution record is created for each workflow execution triggered and can be viewed in the dashboard or data can be retrieved via the API.

Workflows are created and managed via the Cumulus deployment.

Configuration Fields

Schemas

Looking at our API schema definitions can provide us with some insight into collections, providers, rules, and their attributes (and whether those are required or not). The schema for different concepts will be reference throughout this document.

note

The schemas are extremely useful for understanding which attributes are configurable and which of those are required. Cumulus uses these schemas for validation.

Providers

note
  • While connection configuration is defined here, things that are more specific to a specific ingest setup (e.g. 'What target directory should we be pulling from' or 'How is duplicate handling configured?') are generally defined in a Rule or Collection, not the Provider.
  • There is some provider behavior which is controlled by task-specific configuration and not the provider definition. This configuration has to be set on a per-workflow basis. For example, see the httpListTimeout configuration on the discover-granules task

Provider Configuration

The Provider configuration is defined by a JSON object that takes different configuration keys depending on the provider type. The following are definitions of typical configuration values relevant for the various providers:

Configuration by provider type
S3
KeyTypeRequiredDescription
idstringYesUnique identifier for the provider
globalConnectionLimitintegerNoInteger specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited
maxDownloadTimeintegerNoMaximum download time in seconds for all granule files on a sync granule task. The timeout is used together with globalConnectionLimit to limit concurrent downloads.
protocolstringYesThe protocol for this provider. Must be s3 for this provider type.
hoststringYesS3 Bucket to pull data from
http
KeyTypeRequiredDescription
idstringYesUnique identifier for the provider
globalConnectionLimitintegerNoInteger specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited
maxDownloadTimeintegerNoMaximum download time in seconds for all granule files on a sync granule task. The timeout is used together with globalConnectionLimit to limit concurrent downloads.
protocolstringYesThe protocol for this provider. Must be http for this provider type
hoststringYesThe host to pull data from (e.g. nasa.gov)
usernamestringNoConfigured username for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication
passwordstringOnly if username is specifiedConfigured password for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication
portintegerNoPort to connect to the provider on. Defaults to 80
allowedRedirectsstring[]NoOnly hosts in this list will have the provider username/password forwarded for authentication. Entries should be specified as host.com or host.com:7000 if redirect port is different than the provider port.
certificateUristringNoSSL Certificate S3 URI for custom or self-signed SSL (TLS) certificate
https
KeyTypeRequiredDescription
idstringYesUnique identifier for the provider
globalConnectionLimitintegerNoInteger specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited
maxDownloadTimeintegerNoMaximum download time in seconds for all granule files on a sync granule task. The timeout is used together with globalConnectionLimit to limit concurrent downloads.
protocolstringYesThe protocol for this provider. Must be https for this provider type
hoststringYesThe host to pull data from (e.g. nasa.gov)
usernamestringNoConfigured username for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication
passwordstringOnly if username is specifiedConfigured password for basic authentication. Cumulus encrypts this using KMS and uses it in a Basic auth header if needed for authentication
portintegerNoPort to connect to the provider on. Defaults to 443
allowedRedirectsstring[]NoOnly hosts in this list will have the provider username/password forwarded for authentication. Entries should be specified as host.com or host.com:7000 if redirect port is different than the provider port.
certiciateUristringNoSSL Certificate S3 URI for custom or self-signed SSL (TLS) certificate
ftp
KeyTypeRequiredDescription
idstringYesUnique identifier for the provider
globalConnectionLimitintegerNoInteger specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited
maxDownloadTimeintegerNoMaximum download time in seconds for all granule files on a sync granule task. The timeout is used together with globalConnectionLimit to limit concurrent downloads.
protocolstringYesThe protocol for this provider. Must be ftp for this provider type
hoststringYesThe ftp host to pull data from (e.g. nasa.gov)
usernamestringNoUsername to use to connect to the ftp server. Cumulus encrypts this using KMS. Defaults to anonymous if not defined
passwordstringNoPassword to use to connect to the ftp server. Cumulus encrypts this using KMS. Defaults to password if not defined
portintegerNoPort to connect to the provider on. Defaults to 21
sftp
KeyTypeRequiredDescription
idstringYesUnique identifier for the provider
globalConnectionLimitintegerNoInteger specifying the connection limit for the provider. This is the maximum number of connections Cumulus compatible ingest lambdas are expected to make to a provider. Defaults to unlimited
maxDownloadTimeintegerNoMaximum download time in seconds for all granule files on a sync granule task. The timeout is used together with globalConnectionLimit to limit concurrent downloads.
protocolstringYesThe protocol for this provider. Must be sftp for this provider type
hoststringYesThe ftp host to pull data from (e.g. nasa.gov)
usernamestringNoUsername to use to connect to the sftp server.
passwordstringNoPassword to use to connect to the sftp server.
portintegerNoPort to connect to the provider on. Defaults to 22
privateKeystringNofilename assumed to be in s3://bucketInternal/stackName/crypto
cmKeyIdstringNoAWS KMS Customer Master Key arn or alias

Collections

Break down of [s3_MOD09GQ_006.json](https://github.com/nasa/cumulus/blob/master/example/data/collections/s3_MOD09GQ_006/s3_MOD09GQ_006.json)
KeyValueRequiredDescription
name"MOD09GQ"YesThe name attribute designates the name of the collection. This is the name under which the collection will be displayed on the dashboard
version"006"YesA version tag for the collection
granuleId"^MOD09GQ\\.A[\\d]{7}\\.[\\S]{6}\\.006\\.[\\d]{13}$"YesThe regular expression used to validate the granule ID extracted from filenames according to the granuleIdExtraction
granuleIdExtraction"(MOD09GQ\..*)(\.hdf|\.cmr|_ndvi\.jpg)"YesThe regular expression used to extract the granule ID from filenames. The first capturing group extracted from the filename by the regex will be used as the granule ID.
sampleFileName"MOD09GQ.A2017025.h21v00.006.2017034065104.hdf"YesAn example filename belonging to this collection
files<JSON Object> of files defined hereYesDescribe the individual files that will exist for each granule in this collection (size, browse, meta, etc.)
dataType"MOD09GQ"NoCan be specified, but this value will default to the collection_name if not
duplicateHandling"replace"No("replace"|"version"|"skip") determines granule duplicate handling scheme
ignoreFilesConfigForDiscoveryfalse (default)NoBy default, during discovery only files that match one of the regular expressions in this collection's files attribute (see above) are ingested. Setting this to true will ignore the files attribute during discovery, meaning that all files for a granule (i.e., all files with filenames matching granuleIdExtraction) will be ingested even when they don't match a regular expression in the files attribute at discovery time. (NOTE: this attribute does not appear in the example file, but is listed here for completeness.)
process"modis"NoExample options for this are found in the ChooseProcess step definition in the IngestAndPublish workflow definition
meta<JSON Object> of MetaData for the collectionNoMetaData for the collection. This metadata will be available to workflows for this collection via the Cumulus Message Adapter.
url_path"{cmrMetadata.Granule.Collection.ShortName}/
{substring(file.fileName, 0, 3)}"
NoFilename without extension

files-object

KeyValueRequiredDescription
regex"^MOD09GQ\\.A[\\d]{7}\\.[\\S]{6}\\.006\\.[\\d]{13}\\.hdf$"YesRegular expression used to identify the file
sampleFileNameMOD09GQ.A2017025.h21v00.006.2017034065104.hdf"YesFilename used to validate the provided regex
type"data"NoValue to be assigned to the Granule File Type. CNM types are used by Cumulus CMR steps, non-CNM values will be treated as 'data' type. Currently only utilized in DiscoverGranules task
bucket"internal"YesName of the bucket where the file will be stored
url_path"${collectionShortName}/{substring(file.fileName, 0, 3)}"NoFolder used to save the granule in the bucket. Defaults to the collection url_path
checksumFor"^MOD09GQ\\.A[\\d]{7}\\.[\\S]{6}\\.006\\.[\\d]{13}\\.hdf$"NoIf this is a checksum file, set checksumFor to the regex of the target file.

Rules

Rules are used by to start processing workflows and the transformation process. Rules can be invoked manually, based on a schedule, or can be configured to be triggered by either events in Kinesis, SNS messages or SQS messages.

Rule configuration
KeyValueRequiredDescription
name"L2_HR_PIXC_kinesisRule"YesName of the rule. This is the name under which the rule will be listed on the dashboard
workflow"CNMExampleWorkflow"YesName of the workflow to be run. A list of available workflows can be found on the Workflows page
provider"PODAAC_SWOT"NoConfigured provider's ID. This can be found on the Providers dashboard page
collection<JSON Object> collection object shown belowYesName and version of the collection this rule will moderate. Relates to a collection configured and found in the Collections page
payload<JSON Object or Array>NoThe payload to be passed to the workflow
meta<JSON Object> of MetaData for the ruleNoMetaData for the rule. This metadata will be available to workflows for this rule via the Cumulus Message Adapter.
rule<JSON Object> rule type and associated values - discussed belowYesObject defining the type and subsequent attributes of the rule
state"ENABLED"No("ENABLED"|"DISABLED") whether or not the rule will be active. Defaults to "ENABLED".
queueUrlhttps://sqs.us-east-1.amazonaws.com/1234567890/queue-nameNoURL for SQS queue that will be used to schedule workflows for this rule
tags["kinesis", "podaac"]NoAn array of strings that can be used to simplify search

collection-object

KeyValueRequiredDescription
name"L2_HR_PIXC"YesName of a collection defined/configured in the Collections dashboard page
version"000"YesVersion number of a collection defined/configured in the Collections dashboard page

meta-object

KeyValueRequiredDescription
retries3NoNumber of retries on errors, for sqs-type rule only. Defaults to 3.
visibilityTimeout900NoVisibilityTimeout in seconds for the inflight messages, for sqs-type rule only. Defaults to the visibility timeout of the SQS queue when the rule is created.

rule-object

KeyValueRequiredDescription
type"kinesis"Yes("onetime"|"scheduled"|"kinesis"|"sns"|"sqs") type of scheduling/workflow kick-off desired
value<String> ObjectDependsDiscussion of valid values is below

rule-value

The rule - value entry depends on the type of run:

  • If this is a onetime rule this can be left blank. Example
  • If this is a scheduled rule this field must hold a valid cron-type expression or rate expression.
  • If this is a kinesis rule, this must be a configured ${Kinesis_stream_ARN}. Example
  • If this is an sns rule, this must be an existing ${SNS_Topic_Arn}. Example
  • If this is an sqs rule, this must be an existing ${SQS_QueueUrl} that your account has permissions to access, and also you must configure a dead-letter queue for this SQS queue. Example

sqs-type rule features

  • When an SQS rule is triggered, the SQS message remains on the queue.
  • The SQS message is not processed multiple times in parallel when visibility timeout is properly set. You should set the visibility timeout to the maximum expected length of the workflow with padding. Longer is better to avoid parallel processing.
  • The SQS message visibility timeout can be overridden by the rule.
  • Upon successful workflow execution, the SQS message is removed from the queue.
  • Upon failed execution(s), the workflow is run 3 or configured number of times.
  • Upon failed execution(s), the visibility timeout will be set to 5s to allow retries.
  • After configured number of failed retries, the SQS message is moved to the dead-letter queue configured for the SQS queue.

Configuration Via Cumulus Dashboard

Create A Provider

  • In the Cumulus dashboard, go to the Provider page.

Screenshot of Create Provider form

  • Click on Add Provider.
  • Fill in the form and then submit it.

Screenshot of Create Provider form

Create A Collection

  • Go to the Collections page.

Screenshot of the Collections page

  • Click on Add Collection.
  • Copy and paste or fill in the collection JSON object form.

Screenshot of Add Collection form

  • Once you submit the form, you should be able to verify that your new collection is in the list.

Create A Rule

  1. Go To Rules Page
  • Go to the Cumulus dashboard, click on Rules in the navigation.
  • Click Add Rule.

Screenshot of Rules page

  1. Complete Form
  • Fill out the template form.

Screenshot of a Rules template for adding a new rule

For more details regarding the field definitions and required information go to Data Cookbooks.

state field conditional

If the state field is left blank, it defaults to false.

Rule Examples

  • A rule form with completed required fields:

Screenshot of a completed rule form

  • A successfully added Rule:

Screenshot of created rule