Multipart Chunksize Research Notes
Overview
copy_to_archive uses a copy command that has a chunk-size for multi-part transfers. We currently are using the default value of 8mb, which will cause problems when transferring large files, sometimes exceeding 120Gb.
Implementation Details
- Docs for the copy command mention a
Config
parameter of typeTransferConfig
. - Docs for TransferConfig state that it has a property
- Given the above, we can modify the
s3.copy
command tos3.copy(
copy_source,
destination_bucket, destination_key,
ExtraArgs={
'StorageClass': 'GLACIER',
'MetadataDirective': 'COPY',
'ContentType': s3.head_object(Bucket=source_bucket_name, Key=source_key)['ContentType'],
},
Config=TransferConfig(multipart_chunksize=multipart_chunksize_mb * MB)
) - This will require a variable passed into the lambda.
- Could be set at the collection level under
config['collection']['s3MultipartChunksizeMb']
with a default value in the lambdas/main.tf entry forcopy_to_archive
defined asenvironment {
variables = {
ORCA_DEFAULT_BUCKET = var.orca_default_bucket,
DEFAULT_ORCA_COPY_CHUNK_SIZE_MB = var.orca_copy_chunk_size
}
} - Could also be an overall environment variable, though this is less flexible. In the lambdas/main.tf entry for
copy_to_archive
this would look likeenvironment {
variables = {
ORCA_DEFAULT_BUCKET = var.orca_default_bucket,
ORCA_COPY_CHUNK_SIZE_MB = var.orca_copy_chunk_size
}
}
- Could be set at the collection level under
- The above should be added to other TF files such as terraform.tfvars, orca/main.tf, orca/variables.tf, and lambdas/variables.tf as well as documentation.
Future Direction
- Recommend adding the environment variable
ORCA_COPY_CHUNK_SIZE_MB
to TF and Lambda.- Worth waiting to use the same name as Cumulus, as they are going through a similar change.
- I have read in a couple of sources that increasing
io_chunksize
can also have a significant impact on performance. May be worth looking into if more improvements are desired.- The other variables should be considered as well.