Skip to main content

Multipart Chunksize Research Notes

Overview

copy_to_archive uses a copy command that has a chunk-size for multi-part transfers. We currently are using the default value of 8mb, which will cause problems when transferring large files, sometimes exceeding 120Gb.

Implementation Details

  • Docs for the copy command mention a Config parameter of type TransferConfig.
  • Docs for TransferConfig state that it has a property
  • Given the above, we can modify the s3.copy command to
    s3.copy(
    copy_source,
    destination_bucket, destination_key,
    ExtraArgs={
    'StorageClass': 'GLACIER',
    'MetadataDirective': 'COPY',
    'ContentType': s3.head_object(Bucket=source_bucket_name, Key=source_key)['ContentType'],
    },
    Config=TransferConfig(multipart_chunksize=multipart_chunksize_mb * MB)
    )
  • This will require a variable passed into the lambda.
    • Could be set at the collection level under config['collection']['s3MultipartChunksizeMb'] with a default value in the lambdas/main.tf entry for copy_to_archive defined as
      environment {
      variables = {
      ORCA_DEFAULT_BUCKET = var.orca_default_bucket,
      DEFAULT_ORCA_COPY_CHUNK_SIZE_MB = var.orca_copy_chunk_size
      }
      }
    • Could also be an overall environment variable, though this is less flexible. In the lambdas/main.tf entry for copy_to_archive this would look like
      environment {
      variables = {
      ORCA_DEFAULT_BUCKET = var.orca_default_bucket,
      ORCA_COPY_CHUNK_SIZE_MB = var.orca_copy_chunk_size
      }
      }
  • The above should be added to other TF files such as terraform.tfvars, orca/main.tf, orca/variables.tf, and lambdas/variables.tf as well as documentation.

Future Direction

  • Recommend adding the environment variable ORCA_COPY_CHUNK_SIZE_MB to TF and Lambda.
    • Worth waiting to use the same name as Cumulus, as they are going through a similar change.
  • I have read in a couple of sources that increasing io_chunksize can also have a significant impact on performance. May be worth looking into if more improvements are desired.
    • The other variables should be considered as well.