Skip to main content

Multipart Chunksize Research Notes


copy_to_archive uses a copy command that has a chunk-size for multi-part transfers. We currently are using the default value of 8mb, which will cause problems when transferring large files, sometimes exceeding 120Gb.

Implementation Details

  • Docs for the copy command mention a Config parameter of type TransferConfig.
  • Docs for TransferConfig state that it has a property
  • Given the above, we can modify the s3.copy command to
    destination_bucket, destination_key,
    'StorageClass': 'GLACIER',
    'MetadataDirective': 'COPY',
    'ContentType': s3.head_object(Bucket=source_bucket_name, Key=source_key)['ContentType'],
    Config=TransferConfig(multipart_chunksize=multipart_chunksize_mb * MB)
  • This will require a variable passed into the lambda.
    • Could be set at the collection level under config['collection']['s3MultipartChunksizeMb'] with a default value in the lambdas/ entry for copy_to_archive defined as
      environment {
      variables = {
      ORCA_DEFAULT_BUCKET = var.orca_default_bucket,
      DEFAULT_ORCA_COPY_CHUNK_SIZE_MB = var.orca_copy_chunk_size
    • Could also be an overall environment variable, though this is less flexible. In the lambdas/ entry for copy_to_archive this would look like
      environment {
      variables = {
      ORCA_DEFAULT_BUCKET = var.orca_default_bucket,
      ORCA_COPY_CHUNK_SIZE_MB = var.orca_copy_chunk_size
  • The above should be added to other TF files such as terraform.tfvars, orca/, orca/, and lambdas/ as well as documentation.

Future Direction

  • Recommend adding the environment variable ORCA_COPY_CHUNK_SIZE_MB to TF and Lambda.
    • Worth waiting to use the same name as Cumulus, as they are going through a similar change.
  • I have read in a couple of sources that increasing io_chunksize can also have a significant impact on performance. May be worth looking into if more improvements are desired.
    • The other variables should be considered as well.