ORCA Reconciliation

As part of enhancing ORCA, a needed feature by the working group is being able to reconcile between the ORCA inventory and the Cumulus inventory to determine what file(s) are missing or extra in each catalog. The ORCA team has met with Cumulus several times to understand current reconciliation patterns. Notes from those interface meetings are available below. In addition, there are several resources available from Cumulus and LZARDS which also has to deal with reconciliation in a manor similar to ORCA. It is understood that ORCA will provide the data, but Cumulus will perform the comparison and reporting work to match current reconcile patterns.

Metadata Needed for Cumulus Comparison

The following filters can be applied to a query for determining which files to bring back for catalog comparison between the Cumulus inventory and ORCA.

Name	Data Type	Description
providerId	`List[str]`	The unique ID of the provider(s) making the request.
collectionId	`List[str]`	The unique ID of collection(s) to compare.
granuleId	`List[str]`	The unique ID of granule(s) to compare.
startTimestamp	`timestamp`	Start time for cumulus_create_time date range to compare data.
endTimestamp	`timestamp`	End time for cumulus_create_time date range to compare data.

When querying ORCA for the information these filters will be AND together for a comparison answer.

The sections below provide information on how these filters relate with the expected metadata objects, the attributes those data objects may need for comparison and investigation, and other items related to the data when making the comparison.

The metadata objects include the following.

Provider (providerId)
Collection (collectionId)
Granule (granuleId, startTimestamp, endTimestamp)
File - Needed for comparison.
File Version - Needed for comparison, latest version only.

Relationship of the Metadata Objects

A Provider contains an unique ID and is made up of one or more Collections.
A Collection contains an unique ID that contains the short name, and version and is made up of one or more Granules.
A Granule contains an unique ID and is made up of one or more Files.
A File contains a name unique to the Granule ID and is made up of one or more File Versions.
A File Version contains the versions, sizes, hashes and etags of a file along with pertinent dates.

Potential Attributes of the Metadata Objects

The following provides the potential metadata attributes that need to be captured for each object in order to properly filter and provide information back to Cumulus for reconciliation and Operator investigation of discrepancies. A draft schema for how the objects would interact and be tied together can be seen below.

Optional Attributes

When noted, optional attributes are items not strictly needed for reconciliation but may provide value for additional enhancements which may occur in the future to query and filter the data for data management activities.

Provider Metadata Object

Attribute Name	Data Type	Description	Notable Items
provider_id	`string`	providerId that is provided by Cumulus.	Must be unique.
name	`string`	Unique name of the provider.	Optional but may provide more value for future items.

Collection Metadata Object

Attribute Name	Data Type	Description	Notable Items
collection_id	`string`	collectionId that is provided by Cumulus.	Must be unique per providerId.
shortname	`string`	Short name of the collection (AST_L1A).	Optional but may provide more value for future items.
version	`string`	Version of the collection (061).	Optional but may provide more value for future items.

I also considered adding attributes for collection configuration information like the excluded file types, default archive bucket, and other ORCA configurations. Thinking through it though, I felt it was better to leave these with Cumulus and perform a lookup if they are needed. This was primarily for two reasons. First, there are really no need for them for reconciliation within ORCA. Second, keeping the values in sync with changes made by operators through the Cumulus Dashboard seemed a bit challenging and did not appear to provide a real benefit.

Granule Metadata Object

Attribute Name	Data Type	Description	Notable Items
id	`string`	Internal ORCA granule ID pseudo key.	None
collection_id	`string`	Collection ID from Cumulus that references the Collections table.	None
cumulus_create_time	`timestamp`	createdAt time from Cumulus	None
cumulus_granule_id	`string`	granuleId that is provided by Cumulus.	Must be unique per collectionId
execution_id	`string`	Step function execution ID from AWS.	Unique ID automatically generated by AWS
ingest_time	`timestamp`	Date and time in UTC that the data was originally ingested into ORCA.	None
last_update	`timestamp`	Date and time in UTC that information was updated.	None

Note that the combination of the Cumulus Granule ID (cumulus_granule_id) and the Cumulus Collection ID (collection_id) is unique in the Cumulus granules table.

File Metadata Object

Attribute Name	Data Type	Description	Notable Items
id	`string`	Internal ORCA file ID	None
granule_id	`string`	Granule that the file belongs to references the internal ORCA granule ID.	None
name	`string`	Name of the file. (MOD14A1.061.2020245.hdf)	Must be unique per granuleId
orca_archive_location	`string`	ORCA S3 Glacier bucket the file resides in.	None
cumulus_archive_location	`string`	Cumulus S3 bucket the file is expected to reside in.	None
key_path	`string`	S3 path to the file including the file name and extension, but not the bucket.	Must be unique per archive location
ingest_time	`timestamp`	Date and time in UTC that the data was originally ingested into ORCA	None
etag	`string`	etag of the file object in the AWS S3 Glacier bucket.	None
version	`string`	AWS provided version of the file.	Must be unique per file name/key_path
size_in_bytes	`int`	Size in bytes of the file. From Cumulus ingest.	Part of object passed to archive from Cumulus.
hash	`string`	Checksum hash of the file provided by Cumulus.	Optional. Part of object passed to archive from Cumulus.
hash_type	`string`	Hash type used to calculate the hash value of the file.	Optional. Part of object passed to archive from Cumulus.

Note that in Cumulus the combination of the bucket (cumulus_archive_location) and key (key_path) is unique in the Cumulus files table.

File Version Object

Attribute Name	Data Type	Description	Notable Items
version	`string`	AWS provided version of the file.	Must be unique per file name/key_path
is_latest	`boolean`	AWS provided flag denoting that it is the latest version.	Only one can be latest per file name/key_path. Depending of final table structure this may be optional.
etag	`string`	AWS provided etag of the versioned file. Semi unique	Potentially optional maybe able to be used for internal comparisons and/or data management investigations.
cumulus_etag	`string`	Cumulus provided AWS etag of the file in Cumulus.	Potentially optional. Would need to determine if this information passed along. May be helpful for data management comparisons.
size_in_bytes	`bigint`	Size in bytes of the file. From Cumulus ingest.	Part of object passed to archive from Cumulus.
hash	`string`	Checksum hash of the file provided by Cumulus.	Part of object passed to archive from Cumulus.
hash_type	`string`	Hash type used to calculate the hash value of the file.	Part of object passed to archive from Cumulus.
ingest_time	`timestamp`	Date and time in UTC that the file was ingested into ORCA	None

It is possible that we may want to merge the File Version Object with the File Object. If we were to merge the two objects, here are some things to consider. By merging the objects we would likely only keep the latest version information of the file in the table. By doing this we do simplify reconciliation with Cumulus but, then the only catalog of all the versions available for a file would reside in the glacier S3 bucket. The implication of this is there is no other record of the various versions of the file. Also, we should need to query the S3 bucket if the data manager wanted to inquire what versions were available for a file in order to restore the proper version back or to identify versions that may be eligible for deletions based on policy. If we merge the objects, we may want to capture the total versions available for the file.

Return Metadata Needed by Cumulus

At a minimum, the following metadata should be returned back to Cumulus upon a successful query for reconciliation return from ORCA. The data below is what is needed for reconciliation comparison and general metadata that may be needed by the end user to perform analysis on discrepancies.

Provider ID (providerId)
Collection ID (collectionId)
Shortname
collection Version
Granule ID (granuleId)
Cumulus create time
Execution ID
Granule Ingest Date (ORCA) - may also be referred to as a Create Date.
Granule Last Update (ORCA)
File Name
ORCA Location - archive bucket + archive path
Cumulus Archive Location
Key Path
Etag
File Size
File Hash (Optional)
File Hash Type (Optional)
File Version (always send back latest version to compare)

A rough structure may look like the following below. In order to simplify comparisons, only the latest file version information will be returned. Enhancements we will need consider later include items like updating the latest flag in the database and glacier on a file version when doing a recovery.

{
  "anotherPage": false,
  "granules": [
    {
      "providerId": "lpdaac",
      "collectionId": "MOD14A1___061",
      "id": "MOD14A1.061.A23V45.2020235",
      "createdAt": "2020-01-01T23:00:00Z",
      "executionId": "u654-123-Yx679",
      "ingestDate": "2020-01-01T23:00:00Z",
      "lastUpdate": "2020-01-01T23:00:00Z",
      "files": [
        {
          "name": "MOD14A1.061.A23V45.2020235.2020240145621.hdf",
          "cumulusArchiveLocation": "cumulus-bucket",
          "orcaArchiveLocation": "orca-archive",
          "keyPath": "MOD14A1/061/032/MOD14A1.061.A23V45.2020235.2020240145621.hdf",
          "sizeBytes": 100934568723,
          "hash": "ACFH325128030192834127347",
          "hashType": "SHA-256",
          "version": "VXCDEG902"
        },
        ...
      ]
    },
    ...
  ]
}

Some items we will need to think about include managing paging (start/stop) for large returns and limiting size of the message across the wire. Additionally, results should be ordered by granule_id.

Potential ORCA Reconciliation Data Reporting Architecture

The following provides a rough, high level architecture for reporting the ORCA holdings to Cumulus. When providing data back to Cumulus, the solution should be able to handle the filters supplied and return back the needed data. In addition, the solution should be extensible and should allow for potential changes in the underlying data structure or storage engine as ORCA should be able to handle AWS PostgreSQL RDS, AWS Aurora PostgreSQL, and AWS Aurora PostgreSQL Serverless offerings.

Populating the metadata

Populating the metadata objects with the proper metadata will need to come from various sources. The sections below will go through thoughts and ideas of where the metadata will originate from to populate the objects and potential ideas on what the underlying architecture may look like to perform the tasks.

Automated Cumulus Ingest Workflow

The conventional way of populating the objects with metadata would be through the normal Cumulus ingest workflow that contains ORCA's copy_to_archive lambda. The copy_to_archive lambda would need to be modified to take or pull additional metadata needed to associate a granule's files to Cumulus, like providerId, collectionId, granuleId, and various other file metadata information. The additional collected metadata would sent to a queue to be written to the ORCA catalog. The body sent to the queue would have enough information for another ORCA lambda to write and/or update the records for the various objects in the ORCA catalog.

Note that this new database write lambda or the copy_to_archive lambda may need to make some additional queries to the S3 ORCA bucket to pull information like version and other values. Looking at the original copy_to_archive, s3.copy is used to copy data to the ORCA archive. Based on initial research located here, the s3.copy command does not return any information. See the test performed below. Because of this, we will likely need to perform a listing on the object to get additional information like the S3 object version as seen in the sample below. Some items to consider when looking at the options for retrieving the additional metadata include:

speed - How fast can I get the needed metadata?
cost - What is the cost impact of additional gets on the s3 bucket?
complexity - Where is the best place to put these additional gets? Is the overall design and/or implementation to complex?

Boto3 Copy Test

import boto3

# Create the Client
s3_client = boto3.resource('s3')

# Get a source and destination to copy
copy_source = {"Bucket": "orca-sandbox-s3-provider", "Key": "MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065111.hdf"}
copy_destination = "orca-sandbox-archive"

# perform the copy command
copy_command = s3_client.meta.client.copy(copy_source, copy_destination, copy_source["Key"])

# Return results
print(copy_command)
None

Boto3 Get Additional File Information

This would occur after an s3.copy and assumes that the copy is synchronous and that the file is not be ingested in parallel in a separate ingest pipeline that would lead to a collision.

import boto3

# Create the Client
s3_client = boto3.resource('s3')

# Get a source and destination to copy
copy_source = {"Bucket": "orca-sandbox-s3-provider", "Key": "MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065111.hdf"}
copy_destination = "orca-sandbox-archive"

# Get the versioning information
file_versions = s3_client.meta.client.list_object_versions(Bucket=copy_destination, Prefix=copy_source["Key"] )

for version in file_versions["Versions"]:
    if version["IsLatest"]:
        print(version)


# Return results
{
    'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"',
    'Size': 6,
    'StorageClass': 'STANDARD',
    'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065111.hdf',
    'VersionId': 'null',
    'IsLatest': True,
    'LastModified': datetime.datetime(2021, 5, 18, 16, 25, 58, tzinfo=tzutc()),
    'Owner': {
        'DisplayName': 'gsfc-esdis-edc-lpdaac-sandbox-7343-root',
        'ID': '4d102362d8dc848404655e5418ff76cd342418b8d2ea3a04aa5db133bd6ac1db'
    }
}

Note that the data we would likely want are ETag, VersionId, Size, and LastModified. In the example above, the file I used was not part of a versioned S3 bucket thus the null value for VersionId.

Boto3 Use Multipart Copy

In addition to the items above, I also looked at multipart copy which would return the information natively, but there is a 5Mb minimum size limit when using this functionality which is not small enough for some of the small files we may encounter. Reference information on multipart copy is provided below.

Manually Run ORCA Ingest Workflows

When manually running a workflow to ingest data into ORCA, it is assumed that the same modified copy_to_archive lambda used in an automated workflow would be used in the manual workflow. This would mean the translation lambda used to pass the proper information from the Cumulus Dashboard call to the copy_to_archive lambda would need to retrieve and provide information similar to the information that is provided by the moveGranules lambda in the normal workflow. The translation lambda will likely need to leverage Cumulus API calls to get information related to the files and granule from Cumulus. The following cards that are planned to be worked during PI 21.2 contain information related to this type of workflow and translation lambda. See the list below.

Work related to reconciliation will likely require modifications to the translation lambda to pull additional information needed. The following are resources that may be utilized to gather additional granule and file information from Cumulus if needed. See the resource links below.

Note that the CMR call may not be necessary depending on the required metadata needed. In addition, the CMR concept ID call can be retrieved from the cmrLink key in the Cumulus API Get Granule Information. There may be some permission issues to work through if the collection is not public or the granule is hidden for some reason which means we would need to perform a login with a user that would have access to see all of the providers holdings. Additional research may need to be performed to better understand how to gather this information.

Back Population of the ORCA Catalog

Generally, the first two options listed will handle most use cases for adding information to the ORCA archive. However, for users that already have ORCA installations that are older than or equal to v3.x, a new workflow will need to be created to back populate the ORCA metadata catalog. Ideally, this workflow would be kicked off as part of the ORCA installation and/or migration.

As a general breakdown of the flow, the Lambda or Workflow will need to do the following:

Query Cumulus looking for Collections configured for ORCA.
Using the list of Collections configured for ORCA, retrieve the granule and file information.
Get a list of the files in each ORCA bucket.
Match attributes from the ORCA file list like file name, key path, and size to file(s) in the Cumulus list.
Create a message body from the information similar to what copy_to_granule will perform.
Send the message body to the SQS FIFO queue so the database lambda can write the record to the database.

There are likely several ways to optimize this flow. The above is just a general concept to help get the information needed to match files. Collection regex information could also possibly be used. The Cumulus files table allows us to match key based on our listing, but the combination of key and Cumulus bucket are unique unique which means we could match multiple records based on key alone.

Note that this concept of back populating by matching data is not 100% accurate and is a best attempt to pull data and match it without "re-ingesting" items back into ORCA. This approach will likely need lots of logging and reporting to let end users know where discrepancies and best guesses or misses occur. We should also talk with end users about this option prior to implementing to determine if this is needed.

After discussions with the ORCA Working Group during the June meeting the concept above was discussed. The general consensus from the group was that re-population of ORCA through a standalone workflow similar to the items laid out in the manual run was the best option. In addition, the question of whether the back population method described above was a worthwhile path to pursue was posed on the orca Slack channel and the users currently using ORCA all agreed they would rather re-ingest into ORCA as seen by the Slack thread here.

Potential ORCA Catalog Population Architecture

The following is a potential architecture for populating the ORCA catalog under normal use cases. Note that the SQS queue and orca_catalog_ingest lambda could be replaced with a GraphQL server if deemed appropriate.

Possible Technologies for Enabling the Work

The following is a short list of technologies that could be leveraged during this work. These technologies fall within current architectures for Cumulus and extensions and may contribute to more rapid development of features as we progress. A full list of AWS services available in NGAP is available here.

API Gateway

The ORCA API Gateway provides an entry point for the Cumulus dashboard and users to interact directly with ORCA. For more information on the ORCA API gateway, see the API Gateway research page. All reconciliation calls from Cumulus should occur through the ORCA API Gateway.

GraphQL

GraphQL is a query language for APIs. This is an additional research spike topic. The promise of this route further consolidates our database read/write code while at the same time providing a single API endpoint for clients (both ORCA and Cumulus) to retrieve the data from ORCA. This technology though adding some complexity could simplify some code and design aspects.

Research aspects would include GraphQL server to use (COTS or home built). Impacts to AWS account (cost, speed, capabilities, security). Synergy with other projects like Cumulus.

Some initial resources to start would include the following:

SQS

AWS SQS FIFO queues should be used to manage the records to be written to the ORCA database. This allows the team to decouple the database code from several lambdas and centralize the code to one area. This may or may not be needed if GraphQL is utilized.

AWS PostgreSQL RDS Offerings

To better align with Cumulus, ORCA will utilize the AWS PostgreSQL RDS offerings. See notes on the ORCA transition to the ORCA database here. The resultant code from the reconciliation work should be able to handle working with a PostgreSQL back end. Both PostgreSQL RDS and Amazon Aurora are viable back end choices that may be utilized. See a comparison article on the databases here.

NodeJS (Knex) / Python (SQLAlchemy)

An agnostic language framework for working with the database should be used some options include but are not limited to the items described below. Utilizing a more general API may help manage changes to the back end database engine with little or no code changes.

Nodejs and Knex is what the Cumulus team currently utilizes for database interactions. Implementation could follow best practices already established by the Cumulus team.
Python and SQLAlchemy Core is currently being used in some ORCA code.

Comparison between ORCA S3 and the ORCA Catalog

In addition to the comparison between the Cumulus catalog and ORCA, ORCA will also need to manage and maintain the catalog holdings between the ORCA database and the actual contents of the ORCA S3 buckets. This will require comparing the Key, LastModified, ETag, and Size keys to the of the files in the bucket to the ORCA holdings catalog. Discrepancies should be reported back to the user and detail orphan and missing files. We could possibly create a temporary reporting table to hold these values that the Cumulus Dashboard could display for us.

Some things to think through when designing this solution are factors of scale and managing them. To provide some thought and guidance, LP DAAC (one of the larger DAACs) has approximately 300 million files in the archive and growing. If all the files are also stored in ORCA, we will need to reconcile all the files in a smart and somewhat quick way on a semi regular basis.

Ideally, this reconciliation would occur on a regular schedule using a rule and reports would be stored in the database for users to retrieve the information. Additionally, the code should handle failures gracefully and be able to pick up where it left off either gathering the current contents in S3 or performing the ORCA catalog internal comparisons with S3.

Future enhancements may include displaying the information on the Cumulus dashboard, notifying operators via email or other means of discrepancies, or automated healing of the archive based on reconciliation results.

Getting a Listing from AWS S3

To get a file listing of an S3 bucket, the list_objects_v2 method can be used from the boto3 library. An example of the usage can be seen below. Some notable items if this library is used includes the following.

Only a max of 1000 objects are returned per call. Additional logic will be needed to page through and get all of the results.
The listing can be filtered by prefix. Possibly some room here to add configuration to look at the collection lever if a collection can be denoted by prefix.
Version information is not included, only the latest file information is provided.
An additional call will need to be made per object to look at versions available for a file.
Only limited fields are returned from the call.
Each list call does cost in AWS (tiny fractions of a cent).

Other considerations. It may be easier to load the data from the listing calls into a temporary table in the database and perform a handful of queries to find the delta information instead of looping through and doing a comparison in memory.

Example Using list_objects_v2

import boto3

# Set the variables
s3_client = boto3.resource('s3')
copy_destination = "orca-sandbox-archive"

# Get the listing
files = s3_client.meta.client.list_objects_v2(Bucket=copy_destination)

# Print the contents
for file in files["Contents"]:
    print(file)

# Results
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065104.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 15, 20, 4, 8, tzinfo=tzutc()), 'ETag': '"aaf96912124e107bf22016c7695fdcef"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065104.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065104.hdf.met', 'LastModified': datetime.datetime(2020, 12, 15, 20, 4, 7, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065104_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 59, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065105.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 17, 16, 35, 22, tzinfo=tzutc()), 'ETag': '"b4b1ea93026eece5390278d97bfd5c3e"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065105.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065105.hdf.met', 'LastModified': datetime.datetime(2020, 12, 17, 16, 35, 22, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065105_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065106.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 17, 17, 5, 43, tzinfo=tzutc()), 'ETag': '"67f535ab5d6398c2181ab7cf4763f84b"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065106.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065106.hdf.met', 'LastModified': datetime.datetime(2020, 12, 17, 17, 5, 42, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065106_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065107.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 17, 20, 58, 55, tzinfo=tzutc()), 'ETag': '"a71512f2cc5aa43edbf3573c76ccf5aa"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065107.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065107.hdf.met', 'LastModified': datetime.datetime(2020, 12, 17, 20, 58, 54, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065107_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 59, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065108.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 18, 19, 10, 16, tzinfo=tzutc()), 'ETag': '"6d76238799f45466a9431a9187aa5c70"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065108.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065108.hdf.met', 'LastModified': datetime.datetime(2020, 12, 18, 19, 10, 14, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065108_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 59, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065109.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 18, 19, 19, 21, tzinfo=tzutc()), 'ETag': '"f7ffec3ced795379e2346183221bd5e8"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065109.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 57, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065109.hdf.met', 'LastModified': datetime.datetime(2020, 12, 18, 19, 19, 20, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065109_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 57, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065110.cmr.xml', 'LastModified': datetime.datetime(2020, 12, 18, 19, 28, 44, tzinfo=tzutc()), 'ETag': '"9fe1b5cf984cd64b8b7c31b49560ca6e"', 'Size': 2320, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065110.hdf', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 57, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065110.hdf.met', 'LastModified': datetime.datetime(2020, 12, 18, 19, 28, 44, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065110_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 57, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065111.hdf', 'LastModified': datetime.datetime(2021, 5, 18, 16, 25, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'STANDARD'}
{'Key': 'MOD09GQ/006/MOD09GQ.A2017025.h21v00.006.2017034065111_ndvi.jpg', 'LastModified': datetime.datetime(2021, 3, 23, 16, 59, 58, tzinfo=tzutc()), 'ETag': '"81f4b6c158d25f1fe916ea52e99d1700"', 'Size': 6, 'StorageClass': 'GLACIER'}

Potential Internal Reconciliation Architecture

The following provides an architecture of how ORCA internal reconciliation may work in an AWS environment with Cumulus. The architecture uses AWS' S3 Inventory report combined with S3 triggering to fire the comparison off on a normal schedule. Likely the running of this type of job will need to scale. When returning results through the API, it is recommended that we use a "path" parameter that filters based on the file's 'Key'. This will allow Cumulus to build values such as collection into a filter so long as they construct keys using it.

S3 Inventory Report

S3 Inventory reports are created on S3 buckets and post information about that bucket's contents to another bucket. For our purposes, the following options must be checked:

Size
Last modified
Storage class
ETag

Additionally, set Object versions to Include all versions.

Potential Internal Reporting Database Structure

The objects needed for managing ORCA internal reconciliation and reporting are provided in detail below.

An example schema showing how the objects interact and would be tied together is shown below.

Status Table reconcile_status

Reference table for valid status values and status order.

Attribute Name	Data Type	Description	Notable Items
id	`int2`	Status ID.	Primary Key
value	`string`	Human readable status value.

Internal Reconciliation Jobs Table reconcile_job

This table would capture internal reconciliation job information. Below is a sampling of attributes that may be needed. Based on design and implementation, additional attributes may be required. This table, in particular may need additional items for managing jobs.

Attribute Name	Data Type	Description	Notable Items
id	`int2`	Unique Job ID.	Primary Key
orca_archive_location	`string`	ORCA S3 Archive bucket the reconciliation targets.	Location in S3.
status_id	`int8`	Current status of the job.	Generally reflect the stages of reconciliation.
inventory_creation_time	`timestamptz`	Inventory report initiation time from the s3 manifest.	Cannot be NULL
start_time	`timestamptz`	Start time of the job.	Cannot be NULL
last_update	`timestamptz`	Last time the status was updated.	Cannot be NULL
end_time	`timestamptz`	End time of the job.	NULLABLE column
error_message	`string`	Job error message in cases of failure.	NULLABLE column

S3 Inventory Temporary Table reconcile_s3_object

This is a temporary object that would store the listing from the S3 bucket for comparison.

Attribute Name	Data Type	Description	Notable Items
job_id	`int8`	Reconcile job the listing is a part of.	Many to one relationship with the jobs table.
orca_archive_location	`string`	ORCA S3 Archive bucket the file resides in.	Location in S3.
key_path	`string`	Key key value from S3 listing.	Should be able to match to the key_path in the ORCA catalog File metadata object. Must be unique.
etag	`string`	ETag key value from S3 listing.	Should be able to match to the orca_etag value in the ORCA catalog File metadata object.
last_update	`timestamptz`	LastModified key value from the S3 listing.	Should be able to match the last_update value in the ORCA catalog File metadata object.
size_in_bytes	`int8`	Size key value from the S3 listing.	Should be able to match the size value in the ORCA catalog File metadata object.
storage_class	`string`	StorageClass key value from the S3 listing.	May be used as a filter to exclude newly added items by excluding those with a value of STANDARD

Internal Reconciliation Reporting Table Orphaned Files reconcile_orphan_report

This table would capture the files that reside in S3 but have no match in the ORCA catalog. Generally, these files would be deleted or investigated to see if items need to be re-ingested into ORCA. Note that entries with a last_update time after or immediately before the job's start_time may not have catalog entries yet due to a slight delay in the catalog population. This should be indicated when reporting results.

Attribute Name	Data Type	Description	Notable Items
job_id	`int8`	Reconcile job the report is a part of.	FK constraint back to the jobs table with a many to one relationship.
key_path	`string`	Key key value from S3 listing.
etag	`string`	ETag key value from S3 listing.
last_update	`timestamptz`	LastModified key value from the S3 listing.
size_in_bytes	`int8`	Size key value from the S3 listing.
storage_class	`string`	StorageClass key value from the S3 listing.

Internal Reconciliation Reporting Table S3 Missing reconcile_phantom_report

This table captures files that have records in the orca catalog, but are missing from S3. These files may have been deleted outside or a proper deletion flow, and may need to have their catalog entries manually cleared. Note that any entries with an orca_last_update after the job's start_time may show as an error due to race conditions with the catalog update process, and should be displayed as such when reporting results.

Attribute Name	Data Type	Description	Notable Items
job_id	`int8`	Reconcile job the report is a part of.	FK constraint back to the jobs table with a many to one relationship.
collection_id	`string`	Collection ID from the ORCA Collection object the file belongs to.	From Catalog.
granule_id	`string`	Granule ID from the ORCA Granule object the file belongs to.	From Catalog.
file_name	`string`	File name from the ORCA File object the file belongs to.	From Catalog.
key_path	`string`	Key path that includes the file name from the ORCA File object the file belongs to.	From Catalog.
orca_etag	`string`	Etag value from the ORCA File object the file belongs to.	From Catalog.
orca_last_update	`timestamptz`	Last Update value from the ORCA File object the file belongs to.	From Catalog.
orca_size	`int8`	Size value from the ORCA File object the file belongs to.	From Catalog.

Internal Reconciliation Reporting Table Catalog Missing or Mismatches reconcile_mismatch_report

This table would capture the discrepancies for reporting out to the operators about files that are missing from ORCA S3 or have different metadata values than what is expected. Note that any entries with an orca_last_update after the job's start_time may show as an error due to race conditions with the catalog update process, and should be displayed as such when reporting results.

Attribute Name	Data Type	Description	Notable Items
job_id	`int8`	Reconcile job the report is a part of.	FK constraint back to the jobs table with a many to one relationship.
collection_id	`string`	Collection ID from the ORCA Collection object the file belongs to.	From Catalog.
granule_id	`string`	Granule ID from the ORCA Granule object the file belongs to.	From Catalog.
file_name	`string`	File name from the ORCA File object the file belongs to.	From Catalog.
key_path	`string`	Key path that includes the file name from the ORCA File object the file belongs to.	From Catalog.
orca_etag	`string`	Etag value from the ORCA File object the file belongs to.	From Catalog.
s3_etag	`string`	ETag key value from S3 listing.
orca_last_update	`timestamptz`	Last Update value from the ORCA File object the file belongs to.	From Catalog.
s3_last_update	`timestamptz`	LastModified key value from the S3 listing.
orca_size_in_bytes	`int8`	Size value from the ORCA File object the file belongs to.	From Catalog.
s3_size_in_bytes	`int8`	Size key value from the S3 listing.
discrepancy_type	`string`	Type of discrepancy encountered like "value x mismatch"

Open Questions yet to be answered

What is the proper date to use when receiving a date range from Cumulus for a reconciliation report?
Should granule acquisition date be a part of the metadata?
Should optional identified attributes be added an populated now or should we start with the minimum needed?
Is GraphQL a good replacement for the current SQS - Lambda architecture?

Additional Artifacts from the Research

As part of the research, additional artifacts were created in order to help speed along prototyping and design. Information and instructions on the artifacts are provided below.

Aqua Data Studio ERD File

An entity relationship diagram (ERD) was created in Aqua Data Studio (ADS). The ERD provides table, constraint, and reference information for the various objects in the PostgreSQL database used by ORCA. The ERD diagram can be used for development and scripting as well as a way to visualize object interactions. The ADS ERD for ORCA can be downloaded here.

Schema Install SQL Scripts

In order to quickly prototype, two SQL install scripts were created. The scripts install the ORCA inventory catalog and the internal reconciliation tables respectively. The following preconditions are expected before the install scripts are run.

The latest ORCA schema is installed in PostgreSQL (v3.x)
The user is logged into the orca database as the admin user to run the script.

The ORCA inventory catalog build script is available for download here.

The ORCA internal reconciliation tables build script is available for download here.

As part of the research, some refinements were made to the ORCA inventory catalog schema. This includes refinements to the granules table and removal of the provider_collection_xref table and the relationship between provider and collection.

execution_id has been added for helping with logging and tracing.
cumulus_create_time has been added for time based filtering with cumulus. The cumulus createdAt time will be a consistent time source between catalogs
provider_id has been added to create a relationship between granule and provider.
archive_location has been removed since orca_archive_location already exists in the files table.

Refined Granule Metadata Object

Attribute Name	Data Type	Description	Notable Items
cumulus_granule_id	`string`	granuleId that is provided by Cumulus.	Must be unique per collectionId
provider_id	`string`	providerId that is provided by Cumulus	Ties a granule to a provider.
collection_id	`string`	collectionId that is provided by Cumulus	Ties a granule to a collection. The collectionId and granuleId combination must be unique.
execution_id	`string`	Step function execution ID from AWS.	Unique ID automatically generated by AWS
cumulus_create_time	`timestamp`	createdAt time provided by Cumulus when the granule was ingested.	This is the Cumulus archive time used for time based comparisons.
ingest_time	`timestamp`	Date and time in UTC that the data was originally ingested into ORCA	None
last_update	`timestamp`	Date and time in UTC that information was updated.	Potential time filter for Cumulus reconciliation.

note

It is recommended to add tags for hash and hash_type when the values are available to files that get copied over to S3 bucket.

Refined ERD File & SQL script

The refined ADS ERD for ORCA can be downloaded here. The refined ORCA inventory catalog build script is available for download here.

Addressing open questions

What is the proper date to use when receiving a date range from Cumulus for a reconciliation report?
- After discussion with the Cumulus team, we have decided to use cumulus_create_time which is passed by Cumulus createdAt key in the CMA Message containing the timestamp that the granule began ingest of the granule into Cumulus. This timestamp will be used to filter results for Cumulus when asking for ORCA catalog inventory information and will be captured in the Cumulus granules table.
Should granule acquisition date be a part of the metadata?
- At this time, we are not going to include granule acquisition date since we cannot get it natively from Cumulus but might add it in the future.
Should optional identified attributes be added and populated now or should we start with the minimum needed?
- Added step function's execution_id to the granules table.
Is GraphQL a good replacement for the current SQS - Lambda architecture?
- Based on the research performed and prototype created on GraphQL, it is not currently a good replacement due to the lack of documentation which makes implementation difficult. Check GraphQL research notes for more information.

Metadata Needed for Cumulus Comparison​

Relationship of the Metadata Objects​

Potential Attributes of the Metadata Objects​

Provider Metadata Object​

Collection Metadata Object​

Granule Metadata Object​

File Metadata Object​

File Version Object​

Return Metadata Needed by Cumulus​

Potential ORCA Reconciliation Data Reporting Architecture​

Populating the metadata​

Automated Cumulus Ingest Workflow​

Boto3 Copy Test​

Boto3 Get Additional File Information​

Boto3 Use Multipart Copy​

Manually Run ORCA Ingest Workflows​

Back Population of the ORCA Catalog​

Potential ORCA Catalog Population Architecture​

Possible Technologies for Enabling the Work​

API Gateway​

GraphQL​

SQS​

AWS PostgreSQL RDS Offerings​

NodeJS (Knex) / Python (SQLAlchemy)​

Comparison between ORCA S3 and the ORCA Catalog​

Getting a Listing from AWS S3​

Example Using list_objects_v2​

Potential Internal Reconciliation Architecture​

S3 Inventory Report​

Potential Internal Reporting Database Structure​

Status Table reconcile_status​

Internal Reconciliation Jobs Table reconcile_job​

S3 Inventory Temporary Table reconcile_s3_object​

Internal Reconciliation Reporting Table Orphaned Files reconcile_orphan_report​

Internal Reconciliation Reporting Table S3 Missing reconcile_phantom_report​

Internal Reconciliation Reporting Table Catalog Missing or Mismatches reconcile_mismatch_report​

Open Questions yet to be answered​

Additional Artifacts from the Research​

Aqua Data Studio ERD File​

Schema Install SQL Scripts​

Notes from ORCA schema refinement research​

Refined Granule Metadata Object​

Refined ERD File & SQL script​

Addressing open questions​