Exploring PACE OCI L2 Surface Reflectance

In this tutorial we will work with Level 2 data from the PACE’s Ocean Color Instrument (OCI), specificaclly the Surface Reflectance data product. For information on how this data product has been created please visit: PACE OCI Surface Reflectance (SFREFL) Algorithm Theortical Basis Document (ATBD)

The OCI instrument is a hyperspectral imaging radiometer, collecting hyperspectral measurements from 340 nm - 895 nm (Ultra Violet (UV) to near-infrared-NIR), and multi-spectral measurements from 940-2260 nm (NIR to shortwave-infrared).

This tutorial leverages code from Ocean Color OB DAAC Training Tutorials and NASA LP DAAC VITALS tutorials

Learning Objectives

This tutorial will guide you through working with PACE OCI L2 SFREL data on a cloud platform. Upon completion of this tutorial, you will be able to:

  • Access and search for available granules by time and location.
  • Explore data structure and format.
  • Visualize data by plotting individual spectral bands.
  • Apply cloud and water flags to effectively mask out unwanted pixels.
  • Create interactive RGB maps.
  • Display, and export spectral signatures for selected locations.
  • Calculate advanced hyperspectral indices using these L2 datasets.

Contents

Setup

!conda install cf_xarray -y
# Import required libraries
import csv
import math
from io import BytesIO

import cf_xarray
import earthaccess
import rasterio as rio
import xarray as xr
import cartopy

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import holoviews as hv
import geoviews as gv
import hvplot.xarray

Authentication

earthaccess creates and leverages Earthdata Login tokens to authenticate with NASA systems. Earthdata Login tokens expire after a month. To retrieve a token from Earthdata Login, you can either enter your username and password each time you use earthaccess, or use a .netrc file. A .netrc file is a configuration file that is commonly used to store login credentials for remote systems. If you don’t have a .netrc or don’t know if you have one or not, you can use the persist argument with the login function below to create or update an existing one, then use it for authentication.

If you do not have an Earthdata Account, you can create one here.

auth = earthaccess.login(persist=True)

Searching for Collections

The PACE mission produces several collections or datasets available via the NASA Earthdata cloud archive.

To view what’s available, we can use the search_datasets function and with the keyword argument.

# Retrieve Collections
collections = earthaccess.search_datasets(keyword='PACE Surface Reflectance')
# Print Quantity of Results
print(f'Collections found: {len(collections)}')

If you print the collections object you can explore all of the json metadata.

# # Print collections
# collections

We can also create a list of the short-name, concept-id, and version of each result collection using list comprehension. These fields are important for specifying and searching for data within collections.

collections_info = [
    {
        'short_name': c.summary()['short-name'],
        'collection_concept_id': c.summary()['concept-id'],
        'version': c.summary()['version'],
        'entry_title': c['umm']['EntryTitle']
    }
    for c in collections
]
pd.set_option('display.max_colwidth', 150)
collections_info = pd.DataFrame(collections_info)
collections_info

The collection concept-id is the best way to search for data within a collection, as this is unique to each collection. The short-name can be used as well, however the version should be passed as well as there can be multiple versions available with the same short name. After finding the collection you want to search, you can use the concept-id to search for granules within that collection.

Searching for Granules

A granule can be thought of as a unique spatiotemporal grouping within a collection. To search for granules, we can use the search_data function from earthaccess and provide the arguments for our search. Its possible to specify search products using several criteria shown in the table below:

dataset origin and location spatio temporal parameters dataset metadata parameters
archive_center bounding_box concept_id
data_center temporal entry_title
daac point granule_name
provider polygon version
cloud_hosted line short_name

For this example we will search for the PACE OCI Level-2 Regional Surface Reflectance Data using a bounding box and temporal parameters, and add a cloud_cover parameter. Note that not all datasets have cloud cover information, so this parameter may not work for all datasets.

tspan = ("2025-03-01", "2025-03-20")
bbox = (-86.512, 31.70, -86.512, 34.938)
 
results = earthaccess.search_data(
    short_name="PACE_OCI_L2_SFREFL",
    version="3.0",
    temporal=tspan,
    bounding_box=bbox,
    cloud_cover=(0, 50)
)
print(f'Granules found: {len(results)}')

Results is a list, so we can use an index to view a single result.

results[0]

We can also retrieve specific metadata for a result using .keys() since this object also acts as a dictionary.

results[0].keys()
# results[0]['meta']

Feel free to explore the metadata of the results, to check if there are fields relevant for your use case.

From here, we can also retrieve the links for data using earthaccess, which we can either download or stream.

links = [result.data_links()[0] for result in results]
# Display the first 5 links
links[:5]

To access data from NASA, you’ll need to provide your Earthdata Login credentials. When streaming this can best be done using the token or cookies set up by the earthaccess library. Since we’ve already logged in, we can start an fsspec session to manage our connection to a remote file, including sending credentials. This allows other libraries to work with a URL as if it is a local file.

In addition to fsspec for this example, we’ll also use BytesIO to read the full dataset into memory. This is not always the best approach, but for working with a single PACE scene which is ~750 mb, it speeds up the process due to the data structure and some caveats regarding fsspec.

fs = earthaccess.get_fsspec_https_session()

Now, open the connection to the remote file and read into memory.

with fs.open(links[0]) as f:
    file_content = BytesIO(f.read())

Reading dataset

The PACE_OCI_L2_SFREFL files are heirarchichal netCDF files, so we will use the xarray.datatree function to read in and view all of the groups. Using the datatree representation object, we can explore the data structure, which is useful for understanding the groups, variables, dimension order, and coordinates.

datatree = xr.open_datatree(file_content, decode_timedelta=False)
datatree

It will be easier to work with the data if we create a single dataset, which we can do by merging the data and coordinate variables we need from different groups. Longitude and latitude appear as data variables, in the group navigation_data, they need to be explicitly set as coordinates. Note the group geophysical_data and its data variables: rhos and l2_flags. The rhos variable are surface reflectances, and the l2_flags are quality flags as defined by the Ocean Biology Processing Group (OBPG).

ds = xr.merge(
    (
        datatree.ds,
        datatree["geophysical_data"].ds[["rhos", "l2_flags"]],
        datatree["sensor_band_parameters"].coords,
        datatree["navigation_data"].ds.set_coords(("longitude", "latitude")).coords,
    )
)
ds

We can also see which wavelengths we have surface reflectance measurements at by accessing the wavelength_3d coordinate:

# Check the wavelengths available for PACE 
ds["wavelength_3d"]

Note that “wavelength_3d” is an indexed coordinate, which allows us to subset the dataset by slicing or choosing individual wavelengths. The method=“nearest” argument lets us select one wavelength without knowning its exact value.

Hence, we can select one wavelength even if it does not represent an exact value in the “wavelength_3d” array and plot it using the method=“nearest” argument.

rhos_860 = ds["rhos"].sel({"wavelength_3d": 860}, method="nearest")

fig, ax = plt.subplots(figsize=(9, 5), subplot_kw={"projection": cartopy.crs.PlateCarree()})
ax.gridlines(draw_labels={"left": "y", "bottom": "x"}, linewidth=0.25)
ax.coastlines(linewidth=0.5)
ax.add_feature(cartopy.feature.OCEAN, edgecolor="w", linewidth=0.01)
ax.add_feature(cartopy.feature.LAND, edgecolor="w", linewidth=0.01)
rhos_860.plot(x="longitude", y="latitude", cmap="Greys_r", vmin=0, vmax=1.0)
plt.show()
rhos_860

Masking clouds

#Now flags 
ds["l2_flags"]

We will use the cf_xarray library to create a mask including only land and excluding clouds and ice.

# Create and apply Mask
ds["l2_flags"].cf.is_flag_variable
cldwater_mask = (
    (ds["l2_flags"].cf == "LAND")
& ~(ds["l2_flags"].cf == "CLDICE")
)
# Apply mask, creating new dataset
rhos = ds["rhos"].where(cldwater_mask)
# Plot a band of masked data
rhos_860 = rhos.sel({"wavelength_3d": 860}, method="nearest")

fig, ax = plt.subplots(figsize=(9, 5), subplot_kw={"projection": cartopy.crs.PlateCarree()})
ax.coastlines(linewidth=0.25)
ax.gridlines(draw_labels={"left": "y", "bottom": "x"}, linewidth=0.25)
ax.add_feature(cartopy.feature.OCEAN, edgecolor="w", linewidth=0.01)
ax.add_feature(cartopy.feature.LAND, edgecolor="w", linewidth=0.01)
ax.add_feature(cartopy.feature.LAKES, edgecolor="k", linewidth=0.1)
rhos_860.plot(x="longitude", y="latitude", cmap="Greys_r", vmin=0, vmax=1.0)
plt.show()

Projecting data

From viewing the data structure, we can see that the data is in swath format based on the dimensions. This means that we will need to grid the data to work with it in a typical manner.

There are a couple ways to do this transformation, but for this example, we’ll opt for a simple approach using rioxarray. Note that the process shown here will create a grid unique to this scene, so it won’t easily merge or stack with others created in the same way for something like a time-series analysis.

We encourage you to check the following resources for additional information on data transformation for PACE datasets: SeaDAS and Python Jupyter notebook

To grid the data with rioxarray, we need to change the order of the dimensions in the array. Hyperspectral data is often stored (y,x,band) since most processing is conducted along the band dimension; however, rioxarray expects the dimensions to be in the order of (band, y, x), so we must transpose the array to grid the data using rioxarray. Note that we’ll also want to ensure we’re using nearest neighbor resampling to preserve the spectral information during the gridding process. If we use another method we could be creating artificial spectra which could lead to inaccurate unmixing or classification results.

# Transpose data
sr_src = rhos.transpose("wavelength_3d", ...)
sr_src
# Set spatial dimensions and swath CRS
sr_src = sr_src.rio.set_spatial_dims("pixels_per_line", "number_of_lines").rio.write_crs("epsg:4326")
sr_src
# Grid the data using rioxarray and rename dimensions
sr_dst = sr_src.rio.reproject(
    dst_crs=sr_src.rio.crs,
    src_geoloc_array=(
        sr_src.coords["longitude"],
        sr_src.coords["latitude"],
    ),
    nodata=np.nan,
    resampling=rio.warp.Resampling.nearest,
).rename({'y': 'latitude', 'x': 'longitude'})
sr_dst

From this point, we can either subset the data spatially to a smaller region of interest, or extract a series of location points to work with.

Interactive Spatial and Spectral Plots

Combining the Spatial and Spectral information into a single visualization can be a powerful tool for exploring and inspecting imaging spectroscopy data. Using the streams module from Holoviews we can link a spatial map to a plot of spectra.

We could plot a single band image as we previously have, but using a multiband image, like an RGB may help infer what targets we’re examining. Build function to select and adjust the brightness of bands to create a nice RGB image.

Select bands to represent red (650 nm), green (560 nm), and blue (470 nm) by finding the nearest to a wavelength chosen to represent that color.

rgb_da = sr_dst.sel(wavelength_3d=[650, 560, 470], method='nearest')
rgb_da

To get a better rgb image, we can balance the brightness of the selected wavelengths to make a prettier map. This will not affect the data, just the visuals. To do this we will use the function below. We can change the bright argument to increase or decrease the brightness of the scene as a whole.

def gamma_adjust(rgb_da, bright=0.3):
    """
    Adjust gamma across all bands in the RGB dataset.
    """
    array = rgb_da.data
    # Mask nan and negative values
    invalid = np.isnan(array) | (array < 0)
    valid = ~invalid

    # Calculate gamma based on the mean of valid values
    mean_valid = np.nanmean(array[valid])
    gamma = math.log(bright) / math.log(mean_valid)
    # Apply scaling and clip
    scaled = np.full_like(array, np.nan)
    scaled[valid] = np.power(array[valid], gamma)
    rgb_da.data = np.clip(scaled, 0, 1)
    return rgb_da
rgb_da = gamma_adjust(rgb_da, bright=0.3)
rgb_da.hvplot.rgb(x='longitude', y='latitude', bands='wavelength_3d', frame_height=400, geo=True, crs='EPSG:4326', tiles='OSM', title="Stretched RGB PACE Image")

Now that we have an RGB dataset, we can use that to create a spatial plot, and data selected by clicking on our rgb map, which can be inputs for a function to return values from the full dataset at the selected latitude and longitude location using the cell below. To visualize the spectral and spatial data side-by-side, we use the Point Draw tool from the holoviews library.

Click in the RGB image to add spectra to the plot. You can also click and hold the mouse button then drag previously placed points. To remove a point click and hold the mouse button down, then press the backspace key.

# Starting Point from center pixel
x_start = -85.3879
y_start = 31.3446

# Set Point Limit
POINT_LIMIT = 10

# Set up Color Cycling
color_cycle = hv.Cycle('Category20')

first_point = ([x_start], [y_start], [0])

points = gv.Points(first_point, kdims=['longitude','latitude'], vdims='id', crs=cartopy.crs.PlateCarree())

points_stream = hv.streams.PointDraw(
    data=points.columns(),
    source=points,
    drag=True,
    num_objects=POINT_LIMIT,
    styles={'fill_color': color_cycle.values[:POINT_LIMIT], 'line_color': 'gray'}
)

# RGB Plot without Basemap

rgb_map = rgb_da.hvplot.rgb(x='longitude', y='latitude', bands='wavelength_3d',
                            frame_height=480, frame_width=480,
                            crs=cartopy.crs.PlateCarree(),
                            title="Stretched RGB PACE Image")

# Coastlines
coastlines = gv.feature.coastline(projection=cartopy.crs.PlateCarree()).opts(line_color='black', line_width=1)

posxy = hv.streams.PointerXY(source=rgb_map, x=x_start, y=y_start)
clickxy = hv.streams.Tap(source=rgb_map, x=x_start, y=y_start)

# Function to build spectral plot of clicked location to show on hover stream plot
def click_spectra(data):
    coordinates = [c for c in zip(data['longitude'], data['latitude'])]
    
    plots = {}
    for i, coords in enumerate(coordinates):
        x, y = coords
        selected = sr_dst.sel(longitude=x, latitude=y, method="nearest")
        plots[i] = (
            selected.hvplot.line(
                y="rhos",
                x="wavelength_3d",
                xlim=(340, 895), #you can comment or change the range here to see the multispectral bands too
                label=f"{i}"
            )
        )
        points_stream.data["id"][i] = i
    return hv.NdOverlay(plots).opts(hv.opts.Curve(color=color_cycle))

def hover_spectra(x,y):
    return sr_dst.sel(longitude=x,latitude=y,method='nearest').hvplot.line(y='rhos',x='wavelength_3d',
                                                                           color='black', frame_width=480)
# Define the Dynamic Maps
click_dmap = hv.DynamicMap(click_spectra, streams=[points_stream])
hover_dmap = hv.DynamicMap(hover_spectra, streams=[posxy])

# Plot the Map and Dynamic Map side by side
hv.Layout(hover_dmap*click_dmap + rgb_map * coastlines* points).cols(2).opts(
    hv.opts.Points(active_tools=['point_draw'], size=10, tools=['hover'], color='white', line_color='gray'),
    hv.opts.Overlay(show_legend=False, show_title=False, fontscale=1.5, frame_height=480)
)

We can take these selected points and the corresponding reflectance spectra and save them as a .csv for later use.

data = points_stream.data
wavelengths = sr_dst.wavelength_3d.values

rows = [["id", "longitude", "latitude"] + [str(i) for i in wavelengths]]
 
for p in zip(data['longitude'], data['latitude'], data['id']):
    x, y, i = p
    spectra = sr_dst.sel(longitude=x, latitude=y, method="nearest").values
    row = [i, x, y] + list(spectra)
    rows.append(row)
with open('../data/pace_interactive_plot_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(rows)
# In case you cannot save the file, check your directory
# import os
# print(os.getcwd())

Calculate unique indices

Hyperspectral-enabled Vegetation Indices (VI)

We’ll take the Chlorophyll Index Red Edge (CIRE) as an example of a hyperspectral-enabled VI. CIRE uses bands from the red edge and the NIR to get at relative canopy chlorophyll content.

CIRE: (ρ 800 /ρ 705)−1

Carotenoid Content Index (Car): [(1/ρ495)−(1/ρ705)] * ρ800

Information on Hyperspectral enabled indices by PACE here.

## Chlorophyll Index Red Edge (CIRE)
# Select bands
sr_800 = sr_dst.sel({"wavelength_3d": 800}, method="nearest")
sr_705 = sr_dst.sel({"wavelength_3d": 705}, method="nearest")
#Calculate
cire = (sr_800 / sr_705) - 1
cire.attrs["long_name"] = "Chlorophyll Index Red Edge (CIRE)"
cire
#Plot
map1 = cire.hvplot.image(x='longitude', y='latitude', frame_height=480, geo=True, cmap="viridis", tiles='CartoDark', title="CIRE from PACE OCI")
map1
## Carotenoid Content Index (Car) 
# Select additional bands
sr_495 = sr_dst.sel({"wavelength_3d": 495}, method="nearest")

#Calculate
car = ((1 / sr_495)- (1 / sr_705)) * sr_800
car.attrs["long_name"] = "Caretenoid Content Index (Car)"
map2 = car.hvplot.image(x='longitude', y='latitude', frame_height=480, geo=True, cmap="plasma", tiles='CartoDark', title="CAR from PACE OCI")
map2
# Compare CIRE and CAR side to side
map1.opts(frame_height=400, frame_width=480)+map2.opts(frame_height=400, frame_width=480)

Enjoy PACE data

Calculate your own vegetation index

  1. Select relevant bands
  2. Calculate index
  3. Plot data

Contact Info:

Email: LPDAAC@usgs.gov
Voice: +1-866-573-3222
Organization: Land Processes Distributed Active Archive Center (LP DAAC)¹
Website: https://www.earthdata.nasa.gov/centers/lp-daac

¹Work performed under USGS contract 140G0121D0001 for NASA contract NNG14HH33I.