Using the CMR API and asyncio for fast CMR Queries


Summary

This tutorial demonstrates how to effectively perform queries and extract data download Uniform Resource Locators (URLs) for every Common Metadata Repository (CMR) metadata record within a NASA Earthdata collection. Two examples are shown. The first highlight making sequential requests for data URLs associated with specified collections. The second example demonstrates the how to leverages Python’s asyncio package to perform bulk parallel requests for the same information and highlights the increase in speed when doing so. The NASA Earthdata collections highlighted here are Harmonized Landsat Sentinel-2 Operational Land Imager Surface Refleactance and TOA Brightness Daily Global 30m (HLSL30.002) and Harmonized Landsat Sentinel-2 Multi-spectral Instrument Surface Reflactance Daily Global 30m (HLSS30.002).

What is CMR?

The CMR is a metadata system that catalogs NASA’s Earth Observing System Data and Information System (EOSDIS) data and associated metadata. The CMR Application Programming Interface (API) provides programatic search capabilities through CMR’s vast metadata holdings using various parameters and keywords. When querying NASA’s CMR, there is a limit of 1 million granule matched with only 2000 granules returned per page. This guide shows how to search for CMR records using the CMR API and create a list of download URLs. This guide also shows how to leverage asynchronous, or parallel requests, to increase the speed of this process. The example below leverages the Harmonized Landsat Sentinel-2 collection archived by NASA’s LP DAAC to demonstrate how to use Python’s asyncio to perform large queries again NASA’s CMR.

Objectives

  • Use the CMR API and Python to perform large queries (requests that return more than 2000 granules) against NASA’s CMR.
  • Prepare a list of URLs to access or download assets associated with those granules.
  • Utilize asynchronous/parallel requests to increase speed of query and list construction.

Getting Started

Import the required packages.

import requests
import math
import aiohttp
import asyncio
import time

Searching the CMR

Set the CMR API Endpoint. This is the URL that we’ll use to search through the CMR.

CMR_OPS = 'https://cmr.earthdata.nasa.gov/search' # CMR API Endpoint
url = f'{CMR_OPS}/{"granules"}'

To search the CMR we need to set our parameters. In this example we’ll narrow our search using Collection IDs, a range of dates and times, and the number of results we want to show per page. Spatial areas can also be used to narrow searches (example shown in HLS_Tutorial).

Here, we are interested in both HLS Landsat-8 and Sentinel-2 collections collected from October 17-19, 2021. Specify the collections to search, set a datetime_range and set the quantity of results to return per page using the page_size parameter like below.

collections = ['C2021957657-LPCLOUD', 'C2021957295-LPCLOUD'] # Collection or concept_id specific to LPDAAC Products (HLS Landsat OLI and HLS Sentinel-2 respectively) 
datetime_range = '2021-10-17T00:00:00Z,2021-10-19T23:59:59Z'
page_size = 2000

A CMR search can find up to 1 million items or granules, but the number returned per page is limited to 2000, meaning large searches may have several pages of results. By default, page_size is set to 10.

Submitting Requests

Using the above search criteria we can make a request using the requests.get() function. Submit a request and print the response.status_code.

response = requests.get(url, 
                        params={
                            'concept_id': collections,
                            'temporal': datetime_range,
                            'page_size': page_size,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.status_code)

A status code of 200 indicates the request has succeeded.

To see the number of results, print the CMR-Hits found in the returned header.

print(response.headers['CMR-Hits']) # Resulting quantity of granules/items.

Building a List of File URLs

We can build a list of URLs to data assets using our search results. Notice this only uses the first page of results.

granules = response.json()['feed']['entry']
len(granules) # Resulting quantity of granules on page one.
file_list = []
for g in granules:
    file_list.extend([x['href'] for x in g['links'] if 'https' in x['href'] and '.tif' in x['href']])
len(file_list) # Total number of assets from page one of granules.

Print part of the URLs list.

file_list[:25]

This process can be extended to all pages of search results to build a complete list of asset URLs.

Creating a List from Multiple Results Pages

To create a list from multiple results pages, we first define a function to build a list of pages based upon the number of results.

def get_page_total(collections, datetime_range, page_size):
    hits = requests.get(url, 
                        params={
                            'concept_id': collections,
                            'temporal': datetime_range,
                            'page_size': page_size,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       ).headers['CMR-Hits']
    return math.ceil(int(hits)/page_size)

Then we build a list of pages called page_numbers.

page_numbers = list(range(1, get_page_total(collections, datetime_range, page_size)+1))
page_numbers

After we have a list of pages we can iterate through page by page to make a complete list of assets matching our search.

data_urls = [] # empty list
start = time.time() # Begin timer
for n in page_numbers: # Iterate through requests page by page sequentially
    print(f'Page: {n}') # Print Page Number
    response = requests.get(url, # Same request function as used previously
                            params={
                                'concept_id': collections,
                                'temporal': datetime_range,
                                'page_size': page_size,
                                'page_num': n
                            },
                            headers={
                                'Accept': 'application/json'
                            }
                           )
    print(f'Page {n} Resonse Code: {response.status_code}') # Show the response code for each page
    
    granules = response.json()['feed']['entry']
    print(f'Number of Granules: {len(granules)}') # Show the number of granules on each page
    
    for g in granules:
        data_urls.extend([x['href'] for x in g['links'] if 'https' in x['href'] and '.tif' in x['href']])
end = time.time()
print(f'Total time: {end-start}') # Record the total time taken

Show the total quantity of assets in our list matching search parameters.

len(data_urls)

We can also see that the first 25 assets match up from our first page only search results.

file_list[:25]==data_urls[:25]

Improve speed using Asynchronous Requests

You may have noticed the total time the function above took to run. For searches with a large quantity of results, we can query and build a list of asset URLs more quickly by utilizing asynchronous requests. Asynchronous requests can be run concurrently or in parallel, which typically decreases the total time of operations because a response is not needed for the prior request before a subsequent request is made. This time we’ll use a similar approach as before, except we will build a list of page URLs that can be used in asynchronous requests to populate our list of asset URLs more quickly.

First we define a new function get_cmr_pages_urls() to create a list of results pages URLs, not just the page numbers like we did before, then build that list.

def get_cmr_pages_urls(collections, datetime_range, page_size): 
    response = requests.get(url,
                       params={
                           'concept_id': collections,
                           'temporal': datetime_range,
                           'page_size': page_size,
                       },
                       headers={
                           'Accept': 'application/json'
                       }
                      )
    hits = int(response.headers['CMR-Hits'])
    n_pages = math.ceil(hits/page_size)
    cmr_pages_urls = [f'{response.url}&page_num={x}'.replace('granules?', 'granules.json?') for x in list(range(1,n_pages+1))]
    return cmr_pages_urls
urls = get_cmr_pages_urls(collections, datetime_range, page_size)
urls

Next, we create an empty list to populate with our asset URLs.

results = []

Then we define a function get_tasks() to build a list of tasks for each page number URL and a function get_url() to make the requests for each page in parallel with one another.

def get_tasks(session):
    tasks = []
    for l in urls:
        tasks.append(session.get(l))
    return tasks
async def get_url():
    async with aiohttp.ClientSession() as session:
        tasks = get_tasks(session)
        responses = await asyncio.gather(*tasks)
        for response in responses:
            res = await response.json()
            #print(res)
            results.extend([l['href'] for g in res['feed']['entry'] for l in g['links'] if 'https' in l['href'] and '.tif' in l['href']])

Run the functions to submit asynchronous/parallel requests for each page of results.

start = time.time() 

await get_url()

end = time.time()

total_time = end - start
total_time

Much faster than before! We can see the same quantity of results and that a subsample of the resulting asset URLs matches what we retrieved before.

len(results)
data_urls[2025:2125] == results[2025:2125]

Additional Resources

Contact Information

Authors: LP DAAC¹
Contact: LPDAAC@usgs.gov
Voice: +1-866-573-3222
Organization: Land Processes Distributed Active Archive Center (LP DAAC)
Website: https://lpdaac.usgs.gov/
Date last modified: 01-25-2024

¹Work performed under USGS contract G15PD00467 for LP DAAC under NASA contract NNG14HH33I.