Skip to main content

Deep Archive Storage Migration Research Notes

Overview

The goal of this research is to provide an initial design and recommendation for migrating a customer's ORCA Flexible Retrieval (formerly Glacier) archive to Deep Archive storage class. Deep archive S3 storage type is the cheapest glacier storage type currently present in AWS S3. For archive data that does not require immediate access such as backup or disaster recovery use cases, choosing Deep Archive Access tier could be a wise choice.

  • Storage cost is ~$1/TB data per month.
  • Objects in the Deep Archive Access tier can be restored to the Frequent Access tier within 12-48 hours depending on the recovery type. Once the object is in the Frequent Access tier, users can send a GET request to retrieve the object. Like Flexible retrieval, the restored file is only available in frequent access for a configured number of days. For more information on retrieving objects from Flexible and Deep Archive storage, see the [documentation](See https://docs.amazonaws.cn/en_us/AmazonS3/latest/userguide/restoring-objects-retrieval-options.html ).

Assumptions

  • List of objects already exists in orca catalog and S3 bucket.
  • Users should be able to provide some information (possibly a collectionId) to migrate objects.
  • A serverless approach will be used utilizing several Python lambda functions, SNS/SQS and S3 bucket event notifications.
  • The archive bucket has versioning enabled.

Use Cases

Based on the discussion with ORCA team, we came up with two use cases that the end user might be interested in:

  • Convert all the holdings or holdings with a known key prefix from Flexible Retrieval (formerly Glacier) to Deep Archive.
  • Convert specific collection from Flexible Retrieval (formerly Glacier) to Deep Archive.

The details of each use cases and relevant architecture drawings can be found in the next sections.

Migrating all ORCA holdings to deep archive.

The migrate all holdings use case assumes that the end user wants to migrate all ORCA holdings to the DEEP ARCHIVE storage type or specific collections of data that can be grouped together via a key path like /ShortName/Version within the ORCA bucket. The most efficient and cost effective approach to this use case would to have the operator perform the migration steps manually as outlined below.

The steps for performing the migration are as follows:

  1. Configure ORCA default storage type or the collection level storage type overide to DEEP ARCHIVE. See this documentation for additional details.
  2. From the AWS console, set the lifecycle configuration with transition_days = 0 as shown in the terraform code example below.
note

The following S3 lifecycle rule will move all objects with under prefix tmp/ to Deep Archive. For more information on S3 lifecycle policy, see this terraform documentation.

resource "aws_s3_bucket_lifecycle_configuration" "bucket-config" {
bucket = aws_s3_bucket.bucket.bucket

rule {
id = "transition to deep archive"
status = "Enabled"

transition {
days = 0
storage_class = "DEEP_ARCHIVE"
}
filter {
prefix = "tmp/"
}
}

tip

To migrate to a different storage class if needed, change storage_class under transition block to the desired class in the terraform code above.

  1. Wait 48 hours for all objects to migrate to deep archive.
  2. Update the storage type in the ORCA Catalog to the value of DEEP ARCHIVE. Below are three approaches that can accomplish this.
  • (Recommended) Have a lifecycle rule applied to the archive bucket that will trigger a lambda function for every s3:LifecycleTransition event in the bucket. This lambda will automatically update the ORCA catalog status table for storage class to DEEP ARCHIVE at the object level.
  • Manually run a temporary lambda function that updates the catalog status table for storage class to DEEP ARCHIVE after a set amount of time (48+ hours).
  • Manually update the table column in the database using SQL.
  1. Validate internal reconciliation for mismatches in storage class.
  2. If no mismatch, disable and revert back the changes to the lifecycle rules.

A detailed sequence diagram that goes into more depth on the steps and actions needed for manually migrating data to DEEP ARCHIVE is shown below. The sequence diagram uses the recommended steps and approaches called out above.




The following is a container level diagram of the proposed architecture for the recommended manual method.




note

The diagrams above are the initial thoughts for how a migration may be implemented. Changes to the workflow and architecture are likely as the migration is prototyped and validated.

Migrating specific ORCA holdings to deep archive.

It is expected that in certain instances, a DAAC operator will not be able to use the manual approach because of configurations, key path groupings, data selectivity or other factors. In order to be more selective about the data that is migrated to DEEP ARCHIVE, the following solution is proposed.

Generally speaking, the migration of data would be initiated by an operator through the Cumulus Console and two steps would occur. First, the data to be migrated would be restored from the current Flexible Storage (Glacier) holdings. Second, the recovered data would be copied back to the ORCA bucket in the DEEP GLACIER storage class. The steps below go into a general flow of how the events would occur and the user interact during migration.

  1. An operator updates the collection configuration to use the DEEP_ARCHIVE storage type for ORCA. See this documentation for additional details.
  2. User inputs a collecionId via Cumulus console batch execution screens and executes the migration recovery code. Note that PyLOT could also be used to perform this action.
  3. The migration recovery code queries the ORCA catalog and retrieves a list of files in ORCA that are associated with the passed parameters. Once the list of files is received, the code will use Bulk Retrieval to restore the files in the ORCA bucket. The code will also create and update a migration job to track the status of the migration.

Potential implementation of the code could utilize lambda functions, a Docker based Fargate implementation or use AWS Batch. The implementation should take into account the number of bulk requests expected to be made along with recovery capabilities and timeout limits. In the case of a Lambda, a timeout of 15 minutes may be too short to perform a large number of Bulk Recovery requests. Fargate or AWS Batch may provide a better option. Depending on use and performance, additional research may be needed on the implementation of this code. It is recommended that the initial implementation utilize Fargate and Docker.

An example of initiating AWS bulk retrieval of objects using Python and the boto3 client for S3 can be seen below.

import boto3

#this objects list should be provided by the user
retrieve_objects = ['test1.xml', 'test2.xml']

s3 = boto3.client('s3')
for retrieve_object in retrieve_objects:
try:
response = s3.restore_object(
Bucket='orca-rhassan-sandbox-glacier', # bucket that contains the objects currently in glacier flexible retrieval
Key = retrieve_object,
RestoreRequest={
'Days': 30,
'GlacierJobParameters': {
'Tier': 'Bulk',
},
},
)
print(response)
except Exception as ex:
print(f"{ex} for {retrieve_object}")

The resultant output from the boto3 restore_object command can bee seen below.

{'ResponseMetadata': {'RequestId': '61ZWC27NHSFNMYJT', 'HostId': 'ItKZaYvn9rfi6ZVYmUjfZX24mKJlDZlciT9MaKK0aHiNuLWB3Pt8WSEMNt5yBK+5u+MJfzjFDfU=', 'HTTPStatusCode': 202, 'HTTPHeaders': {'x-amz-id-2': 'ItKZaYvn9rfi6ZVYmUjfZX24mKJlDZlciT9MaKK0aHiNuLWB3Pt8WSEMNt5yBK+5u+MJfzjFDfU=', 'x-amz-request-id': '61ZWC27NHSFNMYJT', 'date': 'Sun, 24 Jul 2022 18:01:33 GMT', 'server': 'AmazonS3', 'content-length': '0'}, 'RetryAttempts': 0}}
{'ResponseMetadata': {'RequestId': 'PNN4XXA5SYTBBRKG', 'HostId': 'HgiIiqfb0hZPHgZml98VCyLLfD5LyQ8pSLpgGN6hwUUMtmcPZoo/ACbCL1rXz+pXZ4Ce2UEe34s=', 'HTTPStatusCode': 202, 'HTTPHeaders': {'x-amz-id-2': 'HgiIiqfb0hZPHgZml98VCyLLfD5LyQ8pSLpgGN6hwUUMtmcPZoo/ACbCL1rXz+pXZ4Ce2UEe34s=', 'x-amz-request-id': 'PNN4XXA5SYTBBRKG', 'date': 'Sun, 24 Jul 2022 18:01:34 GMT', 'server': 'AmazonS3', 'content-length': '0'}, 'RetryAttempts': 0}}

For more information on using the restore functionality in Python, see the existing ORCA restore_object function.

Another option is to look into S3 batch operations. It can perform a single operation on lists of Amazon S3 objects and a single job can perform a specified operation on billions of objects containing exabytes of data.

  1. Once bulk retrieval of the files is complete, typically 5 - 12 hours from the request time, the bucket event notification for ObjectRestore:Completed action will trigger code that will copy the restored object back to the ORCA bucket as a DEEP ARCHIVE storage type. The code will then update the ORCA catalog storage type for the object to Deep Archive and the migration job status. Note that the current existing restore workflow trigger should be disabled during the whole migration process to prevent triggering the post_copy_request_to_queue lambda and then re-enabled when migration is complete.

Additional options to consider when doing restoration and copying of files include reusing the ORCA recovery workflow tasks and code to perform the migration or utilizing S3 batch operations for copying the data. Additional research should be done prior to final implementation of the use case to better validate these options.

Additional items to consider when implementing this step are the bucket policy and the use of boto3.copy_object. Make sure the S3 bucket policy has permission for s3:PutObject* action otherwise user will see access denied error. See this policy for additional details.

The boto3 copy_object function can be used in the python lambda function as shown below.

import boto3

s3 = boto3.client('s3')
#this list should be provided by the user
retrieve_objects = ['test1.xml', 'test2.xml']
bucket_name = 'orca-rhassan-sandbox-glacier'
for migrate_object in retrieve_objects:
try:
response = s3.copy_object(
Bucket= bucket_name, #bucket name where the object is stored currently
CopySource={'Bucket': bucket_name, 'Key': migrate_object},
Key = migrate_object, #keyname of object after migration
StorageClass='DEEP_ARCHIVE'
)
print(response)
print(f"copy succeeded for object {migrate_object} having E tag " + response["CopyObjectResult"]["ETag"])
except Exception as ex:
print(ex)

The resultant output from the boto3 copy_object command can bee seen below.

{'ResponseMetadata': {'RequestId': 'EM3XDZ6BB7FJJ88W', 'HostId': '/HayQvl8E2cCsclvO0b4MQUHAJnf7exmJhCILDsXfCieuaJIrAt5ZD9ocOkaam6rQbgXt0qx18A=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/HayQvl8E2cCsclvO0b4MQUHAJnf7exmJhCILDsXfCieuaJIrAt5ZD9ocOkaam6rQbgXt0qx18A=', 'x-amz-request-id': 'EM3XDZ6BB7FJJ88W', 'date': 'Sun, 24 Jul 2022 18:13:32 GMT', 'x-amz-storage-class': 'DEEP_ARCHIVE', 'content-type': 'application/xml', 'server': 'AmazonS3', 'content-length': '234'}, 'RetryAttempts': 0}, 'CopyObjectResult': {'ETag': '"66d8119b127d165f8bc66fa66e0e35a6"', 'LastModified': datetime.datetime(2022, 7, 24, 18, 13, 32, tzinfo=tzutc())}}
copy succeeded for object test1.xml having E tag "66d8119b127d165f8bc66fa66e0e35a6"
{'ResponseMetadata': {'RequestId': 'SA312G5B2RRN5P0C', 'HostId': '/YIl4aUR+CAov4/pxzzd8oVORmYYsewTbwZfnVDkWRa+8g/epeQ7KEZGqSVgcx5LheMhU/dXMfw=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/YIl4aUR+CAov4/pxzzd8oVORmYYsewTbwZfnVDkWRa+8g/epeQ7KEZGqSVgcx5LheMhU/dXMfw=', 'x-amz-request-id': 'SA312G5B2RRN5P0C', 'date': 'Sun, 24 Jul 2022 18:13:33 GMT', 'x-amz-storage-class': 'DEEP_ARCHIVE', 'content-type': 'application/xml', 'server': 'AmazonS3', 'content-length': '234'}, 'RetryAttempts': 0}, 'CopyObjectResult': {'ETag': '"14694ab9e3089a7c29e0dc9add4a82ab"', 'LastModified': datetime.datetime(2022, 7, 24, 18, 13, 33, tzinfo=tzutc())}}
copy succeeded for object test2.xml having E tag "14694ab9e3089a7c29e0dc9add4a82ab"

A detailed sequence diagram that goes into more depth on the steps and actions needed for migrating data to DEEP ARCHIVE is shown below. The sequence diagram uses the recommended steps and approaches called out above.

The following is a container level diagram of the proposed architecture for the collection specific use case.

note

The diagrams above are the initial thoughts for how a migration may be implemented. Changes to the workflow and architecture are likely as the migration is prototyped and validated.

Cards created

Sources

note

The diagrams above are the initial thoughts for how a migration may be implemented. Changes to the workflow and architecture are likely as the migration is prototyped and validated.