import requests
import math
import aiohttp
import asyncio
import time
Using the CMR API and asyncio for fast CMR Queries
Summary
This tutorial demonstrates how to effectively perform queries and extract data download Uniform Resource Locators (URLs) for every Common Metadata Repository (CMR) metadata record within a NASA Earthdata collection. Two examples are shown. The first highlight making sequential requests for data URLs associated with specified collections. The second example demonstrates the how to leverages Python’s asyncio
package to perform bulk parallel requests for the same information and highlights the increase in speed when doing so. The NASA Earthdata collections highlighted here are Harmonized Landsat Sentinel-2 Operational Land Imager Surface Refleactance and TOA Brightness Daily Global 30m (HLSL30.002) and Harmonized Landsat Sentinel-2 Multi-spectral Instrument Surface Reflactance Daily Global 30m (HLSS30.002).
What is CMR?
The CMR is a metadata system that catalogs NASA’s Earth Observing System Data and Information System (EOSDIS) data and associated metadata. The CMR Application Programming Interface (API) provides programatic search capabilities through CMR’s vast metadata holdings using various parameters and keywords. When querying NASA’s CMR, there is a limit of 1 million granule matched with only 2000 granules returned per page. This guide shows how to search for CMR records using the CMR API and create a list of download URLs. This guide also shows how to leverage asynchronous, or parallel requests, to increase the speed of this process. The example below leverages the Harmonized Landsat Sentinel-2 collection archived by NASA’s LP DAAC to demonstrate how to use Python’s asyncio
to perform large queries again NASA’s CMR.
Objectives
- Use the CMR API and Python to perform large queries (requests that return more than 2000 granules) against NASA’s CMR.
- Prepare a list of URLs to access or download assets associated with those granules.
- Utilize asynchronous/parallel requests to increase speed of query and list construction.
Getting Started
Import the required packages.
Searching the CMR
Set the CMR API Endpoint. This is the URL that we’ll use to search through the CMR.
= 'https://cmr.earthdata.nasa.gov/search' # CMR API Endpoint
CMR_OPS = f'{CMR_OPS}/{"granules"}' url
To search the CMR we need to set our parameters. In this example we’ll narrow our search using Collection IDs, a range of dates and times, and the number of results we want to show per page. Spatial areas can also be used to narrow searches (example shown in HLS_Tutorial).
Here, we are interested in both HLS Landsat-8 and Sentinel-2 collections collected from October 17-19, 2021. Specify the collections
to search, set a datetime_range
and set the quantity of results to return per page using the page_size
parameter like below.
= ['C2021957657-LPCLOUD', 'C2021957295-LPCLOUD'] # Collection or concept_id specific to LPDAAC Products (HLS Landsat OLI and HLS Sentinel-2 respectively)
collections = '2021-10-17T00:00:00Z,2021-10-19T23:59:59Z'
datetime_range = 2000 page_size
A CMR search can find up to 1 million items or granules, but the number returned per page is limited to 2000, meaning large searches may have several pages of results. By default, page_size
is set to 10.
Submitting Requests
Using the above search criteria we can make a request using the requests.get()
function. Submit a request and print the response.status_code
.
= requests.get(url,
response ={
params'concept_id': collections,
'temporal': datetime_range,
'page_size': page_size,
},={
headers'Accept': 'application/json'
}
)print(response.status_code)
A status code of 200 indicates the request has succeeded.
To see the number of results, print the CMR-Hits
found in the returned header.
print(response.headers['CMR-Hits']) # Resulting quantity of granules/items.
Building a List of File URLs
We can build a list of URLs to data assets using our search results. Notice this only uses the first page of results.
= response.json()['feed']['entry']
granules len(granules) # Resulting quantity of granules on page one.
= []
file_list for g in granules:
'href'] for x in g['links'] if 'https' in x['href'] and '.tif' in x['href']])
file_list.extend([x[len(file_list) # Total number of assets from page one of granules.
Print part of the URLs list.
25] file_list[:
This process can be extended to all pages of search results to build a complete list of asset URLs.
Creating a List from Multiple Results Pages
To create a list from multiple results pages, we first define a function to build a list of pages based upon the number of results.
def get_page_total(collections, datetime_range, page_size):
= requests.get(url,
hits ={
params'concept_id': collections,
'temporal': datetime_range,
'page_size': page_size,
},={
headers'Accept': 'application/json'
}'CMR-Hits']
).headers[return math.ceil(int(hits)/page_size)
Then we build a list of pages called page_numbers
.
= list(range(1, get_page_total(collections, datetime_range, page_size)+1))
page_numbers page_numbers
After we have a list of pages we can iterate through page by page to make a complete list of assets matching our search.
= [] # empty list
data_urls = time.time() # Begin timer
start for n in page_numbers: # Iterate through requests page by page sequentially
print(f'Page: {n}') # Print Page Number
= requests.get(url, # Same request function as used previously
response ={
params'concept_id': collections,
'temporal': datetime_range,
'page_size': page_size,
'page_num': n
},={
headers'Accept': 'application/json'
}
)print(f'Page {n} Resonse Code: {response.status_code}') # Show the response code for each page
= response.json()['feed']['entry']
granules print(f'Number of Granules: {len(granules)}') # Show the number of granules on each page
for g in granules:
'href'] for x in g['links'] if 'https' in x['href'] and '.tif' in x['href']])
data_urls.extend([x[= time.time()
end print(f'Total time: {end-start}') # Record the total time taken
Show the total quantity of assets in our list matching search parameters.
len(data_urls)
We can also see that the first 25 assets match up from our first page only search results.
25]==data_urls[:25] file_list[:
Improve speed using Asynchronous Requests
You may have noticed the total time the function above took to run. For searches with a large quantity of results, we can query and build a list of asset URLs more quickly by utilizing asynchronous requests. Asynchronous requests can be run concurrently or in parallel, which typically decreases the total time of operations because a response is not needed for the prior request before a subsequent request is made. This time we’ll use a similar approach as before, except we will build a list of page URLs that can be used in asynchronous requests to populate our list of asset URLs more quickly.
First we define a new function get_cmr_pages_urls()
to create a list of results pages URLs, not just the page numbers like we did before, then build that list.
def get_cmr_pages_urls(collections, datetime_range, page_size):
= requests.get(url,
response ={
params'concept_id': collections,
'temporal': datetime_range,
'page_size': page_size,
},={
headers'Accept': 'application/json'
}
)= int(response.headers['CMR-Hits'])
hits = math.ceil(hits/page_size)
n_pages = [f'{response.url}&page_num={x}'.replace('granules?', 'granules.json?') for x in list(range(1,n_pages+1))]
cmr_pages_urls return cmr_pages_urls
= get_cmr_pages_urls(collections, datetime_range, page_size)
urls urls
Next, we create an empty list to populate with our asset URLs.
= [] results
Then we define a function get_tasks()
to build a list of tasks for each page number URL and a function get_url()
to make the requests for each page in parallel with one another.
def get_tasks(session):
= []
tasks for l in urls:
tasks.append(session.get(l))return tasks
async def get_url():
async with aiohttp.ClientSession() as session:
= get_tasks(session)
tasks = await asyncio.gather(*tasks)
responses for response in responses:
= await response.json()
res #print(res)
'href'] for g in res['feed']['entry'] for l in g['links'] if 'https' in l['href'] and '.tif' in l['href']]) results.extend([l[
Run the functions to submit asynchronous/parallel requests for each page of results.
= time.time()
start
await get_url()
= time.time()
end
= end - start
total_time total_time
Much faster than before! We can see the same quantity of results and that a subsample of the resulting asset URLs matches what we retrieved before.
len(results)
2025:2125] == results[2025:2125] data_urls[
Additional Resources
Contact Information
Authors: LP DAAC¹
Contact: LPDAAC@usgs.gov
Voice: +1-866-573-3222
Organization: Land Processes Distributed Active Archive Center (LP DAAC)
Website: https://lpdaac.usgs.gov/
Date last modified: 01-25-2024
¹Work performed under USGS contract G15PD00467 for LP DAAC under NASA contract NNG14HH33I.