File Structure at Three Processing Levels for the Ocean Color Instrument (OCI)

File Structure at Three Processing Levels for the Ocean Color Instrument (OCI)#

Author(s): Anna Windle (NASA, SSAI), Ian Carroll (NASA, UMBC), Carina Poulin (NASA, SSAI)

Last updated: August 3, 2025

Summary#

In this example we will use the earthaccess package to access OCI Level-1B (L1B), Level-2 (L2), and Level-3-Mapped (L3M) NetCDF files and open them using xarray. While you can learn alot exploring the datasets in this way, be ready to refer to the full dataset documentation for details about the data products and processing.

NetCDF (Network Common Data Format) is a binary file format for storing multidimensional scientific data (variables). It is optimized for array-oriented data access and support a machine-independent format for representing scientific data. Files ending in .nc are NetCDF files.

XArray is a package that supports the use of multi-dimensional arrays in Python. It is widely used to handle Earth observation data, which often involves multiple dimensions — for instance, longitude, latitude, time, and channels/bands.

Learning Objectives#

At the end of this notebok you will know:

  • How to find groups in a NetCDF file

  • How to use xarray to open OCI data

  • What key variables are present in the groups within OCI L1B, L2, and L3M files

1. Setup#

Begin by importing all of the packages used in this notebook. If your kernel uses an environment defined following the guidance on the tutorials page, then the imports will be successful.

import cartopy.crs as ccrs
import earthaccess
import h5netcdf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

Set (and persist to your user profile on the host, if needed) your Earthdata Login credentials.

auth = earthaccess.login(persist=True)

2. Explore L1B File Structure#

Let’s use xarray to open up a OCI L1B NetCDF file using earthaccess. We will use the same search method used in OCI Data Access. Note that L1B files do not include cloud coverage metadata, so we cannot use that filter.

tspan = ("2024-05-01", "2024-05-07")
bbox = (-76.75, 36.97, -75.74, 39.01)
results = earthaccess.search_data(
    short_name="PACE_OCI_L1B_SCI",
    temporal=tspan,
    bounding_box=bbox,
)
paths = earthaccess.open(results)

We want to know whether we are running code on a remote host with direct access to the NASA Earthdata Cloud. If without direct access, consider the substitution explained in the Data Access notebook to download granules.

Let’s open the first file of the L1B files list:

dataset = xr.open_dataset(paths[0])
dataset
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes: (12/36)
    title:                             PACE OCI Level-1B Data
    instrument:                        OCI
    platform:                          PACE
    processing_level:                  L1B
    cdm_data_type:                     swath
    geospatial_lat_units:              degrees_north
    ...                                ...
    time_coverage_start:               2024-05-01T16:53:11.072Z
    time_coverage_end:                 2024-05-01T16:58:10.954Z
    processing_version:                3.0.1
    identifier_product_doi_authority:  https://dx.doi.org
    identifier_product_doi:            10.5067/PACE/OCI/L1B/SCI/3
    history:                           2025-08-06T20:00:09Z: l1bgen_oci ephfi...

Notice that this xarray.Dataset has nothing but “Attributes”. Instead of an xarray.Dataset, we want an xarray.DataTree to open NetCDF files that define more than one group of variable. These “Groups” are almost equivalent to an xarray.Dataset, the difference is that a group can also contain another group!

datatree = xr.open_datatree(paths[0])
datatree
<xarray.DataTree>
Group: /
│   Attributes: (12/36)
│       title:                             PACE OCI Level-1B Data
│       instrument:                        OCI
│       platform:                          PACE
│       processing_level:                  L1B
│       cdm_data_type:                     swath
│       geospatial_lat_units:              degrees_north
│       ...                                ...
│       time_coverage_start:               2024-05-01T16:53:11.072Z
│       time_coverage_end:                 2024-05-01T16:58:10.954Z
│       processing_version:                3.0.1
│       identifier_product_doi_authority:  https://dx.doi.org
│       identifier_product_doi:            10.5067/PACE/OCI/L1B/SCI/3
│       history:                           2025-08-06T20:00:09Z: l1bgen_oci ephfi...
├── Group: /sensor_band_parameters
│       Dimensions:                (blue_bands: 119, red_bands: 163, SWIR_bands: 9,
│                                   HAM_sides: 2, polarization_coefficients: 3)
│       Dimensions without coordinates: blue_bands, red_bands, SWIR_bands, HAM_sides,
│                                       polarization_coefficients
│       Data variables: (12/13)
│           blue_wavelength        (blue_bands) float32 476B ...
│           blue_solar_irradiance  (blue_bands) float32 476B ...
│           red_wavelength         (red_bands) float32 652B ...
│           red_solar_irradiance   (red_bands) float32 652B ...
│           SWIR_wavelength        (SWIR_bands) float32 36B ...
│           SWIR_bandpass          (SWIR_bands) float32 36B ...
│           ...                     ...
│           blue_m12_coef          (blue_bands, HAM_sides, polarization_coefficients) float32 3kB ...
│           blue_m13_coef          (blue_bands, HAM_sides, polarization_coefficients) float32 3kB ...
│           red_m12_coef           (red_bands, HAM_sides, polarization_coefficients) float32 4kB ...
│           red_m13_coef           (red_bands, HAM_sides, polarization_coefficients) float32 4kB ...
│           SWIR_m12_coef          (SWIR_bands, HAM_sides, polarization_coefficients) float32 216B ...
│           SWIR_m13_coef          (SWIR_bands, HAM_sides, polarization_coefficients) float32 216B ...
├── Group: /scan_line_attributes
│       Dimensions:             (scans: 1710)
│       Dimensions without coordinates: scans
│       Data variables:
│           time                (scans) datetime64[ns] 14kB ...
│           HAM_side            (scans) float32 7kB ...
│           scan_quality_flags  (scans) float32 7kB ...
├── Group: /geolocation_data
│       Dimensions:         (scans: 1710, pixels: 1272)
│       Coordinates:
│           latitude        (scans, pixels) float32 9MB ...
│           longitude       (scans, pixels) float32 9MB ...
│       Dimensions without coordinates: scans, pixels
│       Data variables:
│           height          (scans, pixels) float32 9MB ...
│           watermask       (scans, pixels) float32 9MB ...
│           sensor_azimuth  (scans, pixels) float64 17MB ...
│           sensor_zenith   (scans, pixels) float64 17MB ...
│           solar_azimuth   (scans, pixels) float64 17MB ...
│           solar_zenith    (scans, pixels) float64 17MB ...
│           quality_flag    (scans, pixels) float32 9MB ...
├── Group: /navigation_data
│       Dimensions:           (scans: 1710, quaternion_elements: 4, vector_elements: 3,
│                              pixels: 1272)
│       Dimensions without coordinates: scans, quaternion_elements, vector_elements,
│                                       pixels
│       Data variables:
│           att_quat          (scans, quaternion_elements) float32 27kB ...
│           att_ang           (scans, vector_elements) float32 21kB ...
│           orb_pos           (scans, vector_elements) float32 21kB ...
│           orb_vel           (scans, vector_elements) float32 21kB ...
│           sun_ref           (scans, vector_elements) float32 21kB ...
│           tilt_angle        (scans) float32 7kB ...
│           CCD_scan_angles   (scans, pixels) float32 9MB ...
│           SWIR_scan_angles  (scans, pixels) float32 9MB ...
└── Group: /observation_data
        Dimensions:    (blue_bands: 119, scans: 1710, pixels: 1272, red_bands: 163,
                        SWIR_bands: 9)
        Dimensions without coordinates: blue_bands, scans, pixels, red_bands, SWIR_bands
        Data variables:
            rhot_blue  (blue_bands, scans, pixels) float32 1GB ...
            rhot_red   (red_bands, scans, pixels) float32 1GB ...
            rhot_SWIR  (SWIR_bands, scans, pixels) float32 78MB ...
            qual_blue  (blue_bands, scans, pixels) float32 1GB ...
            qual_red   (red_bands, scans, pixels) float32 1GB ...
            qual_SWIR  (SWIR_bands, scans, pixels) float32 78MB ...
        Attributes:
            striping_corrected:  yes

Now you can view the Dimensions, Coordinates, and Variables of each group. To show/hide any category, like “Groups”, toggle the drop-down arrow. To show/hide attributes, press the piece-of-paper icon on the right hand side of a variable. To show/hide data representation, press the stacked-cylinders icon. For instance, you could check the attributes on “rhot_blue” to see that this variable is the “Top of Atmosphere Blue Band Reflectance”.

The dimensions of the “rhot_blue” variable are (“blue_bands”, “number_of_scans”, “ccd_pixels”), as shown by the sizes attribute of the variable. This gives us a dictionary of the array’s size in each dimension, keyed on the dimension names.

datatree["observation_data"]["rhot_blue"].sizes
Frozen({'blue_bands': 119, 'scans': 1710, 'pixels': 1272})

Let’s plot the reflectance at postion 100 in the “blue_bands” dimension.

plot = datatree["observation_data"]["rhot_blue"].sel({"blue_bands": 100}).plot()
../../_images/eb543d41a762056ea4e58a17112efce86f7e6e9105b9bbf3caa6ac2daab16c7d.png

3. Explore L2 File Structure#

OCI L2 files include retrievals of geophysical variables, such as Apparent Optical Properties (AOP), for each L1 swath. We’ll use the same earthaccess search for L2 AOP data. Although now we can use cloud_cover too.

clouds = (0, 50)
results = earthaccess.search_data(
    short_name="PACE_OCI_L2_AOP",
    temporal=tspan,
    bounding_box=bbox,
    cloud_cover=clouds,
)
paths = earthaccess.open(results)

Let’s look at the “geophysical_data” group, which is a new group generated by the level 2 processing, and the “Rrs” variable in particular.

datatree = xr.open_datatree(paths[0])
rrs = datatree["geophysical_data"]["Rrs"]
rrs
<xarray.DataArray 'Rrs' (number_of_lines: 1709, pixels_per_line: 1272,
                         wavelength_3d: 172)> Size: 1GB
[373901856 values with dtype=float32]
Dimensions without coordinates: number_of_lines, pixels_per_line, wavelength_3d
Attributes:
    long_name:      Remote sensing reflectance
    units:          sr^-1
    standard_name:  surface_ratio_of_upwelling_radiance_emerging_from_sea_wat...
    valid_min:      -30000
    valid_max:      25000
rrs.sizes
Frozen({'number_of_lines': 1709, 'pixels_per_line': 1272, 'wavelength_3d': 172})

The Rrs variable has 172 values in the wavelength_3d; the blue, red, and SWIR wavelengths have been combined. Note that wavelength_3d is one of the dimensions, but also that the Rrs variable doesn’t have any “Coordinates” or “Indexes”. Indices let us look up positions in an array by the value of it’s coordinates, so we need to dig more variables out of this L2 file. The coordinates variable for wavelength_3d is in the “sensor_band_parameters” group. When we add this variable to the rrs array, xarray creates the corresponding index.

rrs["wavelength_3d"] = datatree["sensor_band_parameters"]["wavelength_3d"]
rrs
<xarray.DataArray 'Rrs' (number_of_lines: 1709, pixels_per_line: 1272,
                         wavelength_3d: 172)> Size: 1GB
[373901856 values with dtype=float32]
Coordinates:
  * wavelength_3d  (wavelength_3d) float64 1kB 346.0 348.0 351.0 ... 717.0 719.0
Dimensions without coordinates: number_of_lines, pixels_per_line
Attributes:
    long_name:      Remote sensing reflectance
    units:          sr^-1
    standard_name:  surface_ratio_of_upwelling_radiance_emerging_from_sea_wat...
    valid_min:      -30000
    valid_max:      25000

Now that wavelength_3d shows up under “Coordinates” and “Indexes” we can reference wavelengths by value.

plot = rrs.sel({"wavelength_3d": 440}).plot(cmap="viridis", robust=True)
../../_images/54c772d557ed18325902254ec54085511698dec1ff574956c59ccaf20f7a05de.png

The scene is being plotted using number_of_lines and pixels_per_line as “x” and “y”, respectively. We need to add more coordinates, the latitude and longitude, for a true map. These coordinates variables are in the “navigation_data” group.

for item in ("longitude", "latitude"):
    rrs[item] = datatree["navigation_data"][item]
rrs
<xarray.DataArray 'Rrs' (number_of_lines: 1709, pixels_per_line: 1272,
                         wavelength_3d: 172)> Size: 1GB
[373901856 values with dtype=float32]
Coordinates:
  * wavelength_3d  (wavelength_3d) float64 1kB 346.0 348.0 351.0 ... 717.0 719.0
    longitude      (number_of_lines, pixels_per_line) float32 9MB ...
    latitude       (number_of_lines, pixels_per_line) float32 9MB ...
Dimensions without coordinates: number_of_lines, pixels_per_line
Attributes:
    long_name:      Remote sensing reflectance
    units:          sr^-1
    standard_name:  surface_ratio_of_upwelling_radiance_emerging_from_sea_wat...
    valid_min:      -30000
    valid_max:      25000

Although we now have coordinates, they won’t immediately help because the data are not gridded by latitude and longitude. L2 data come in the original instrument swath and have not been resampled to a regular grid. That is why latitude and longitude are two-dimensional coordinates, and why the are not also “Indexes” like wavelength_3d. Latitude and longitude are present, but cannot be used immediately to look up values like you can with coordinates that are also indices.

Let’s make a scatter plot of some pixel locations so we can see the irregular spacing of latitude and longitude. By selecting a slice with a step size larger than one, we get a subset of the locations for better visualization.

plot = datatree["navigation_data"].sel(
    {
        "number_of_lines": slice(None, None, 1720 // 20),
        "pixels_per_line": slice(None, None, 1272 // 20),
    },
).dataset.plot.scatter(x="longitude", y="latitude")
../../_images/79f9ce26e091fe9629cf98f62039ea4c920ad6a112c6b63e93bccef108c7fb17.png

Despite not having indices, let’s plot Rrs the same way as before, except for the addition of latitude and longitude.

rrs_sel = rrs.sel({"wavelength_3d": 440})
plot = rrs_sel.plot(x="longitude", y="latitude", cmap="viridis", robust=True)
../../_images/8a730035b777a3d29755445c9f4eb80c0be06ce36e9573b4bdcecd802a85834c.png

Plotting tools can handle non-index Coordinates! You can even visualize the data as if projected onto a grid. If you wanna get fancy, add a coastline.

fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
im = rrs_sel.plot(x="longitude", y="latitude", cmap="viridis", robust=True, ax=ax)
ax.gridlines(draw_labels={"left": "y", "bottom": "x"})
ax.coastlines()
plt.show()
../../_images/42e9eb1d273887100c629a2be80acaf3a9503038b0fbf4242b7e5c0be50036e2.png

Let’s plot the full “Rrs” spectrum for individual pixels. A visualization with all the pixels wouldn’t be useful, but limiting to a bounding box gives a simple way to subset pixels. Note that, since we don’t have gridded data (i.e. our latitude and longitude coordinates are two-dimensional), we can’t slice on these values. Without getting into anything complex, we will use a where with logical tests.

rrs_box = rrs.where(
    (
        (rrs["latitude"] > 26.0)
        & (rrs["latitude"] < 26.1)
        & (rrs["longitude"] > -84.1)
        & (rrs["longitude"] < -84.0)
    ),
    drop=True,
)
rrs_box.sizes
Frozen({'number_of_lines': 10, 'pixels_per_line': 3, 'wavelength_3d': 172})

The line plotting method will only draw a line plot for 1D data, which we can get by stacking our two spatial dimensions and choosing to show the new “pixel dimension” in different hues.

rrs_stack = rrs_box.stack(
    {"pixel": ["number_of_lines", "pixels_per_line"]},
    create_index=False,
)
plot = rrs_stack.plot(hue="pixel", add_legend=False)
../../_images/96d6049d25e73d6d8ea928000e3bf5f437256b648660914701539420b18e33ac.png

4. Explore L3M File Structure#

At L3M there are binned (B) and mapped (M) products available. The L3M remote sensing reflectance (Rrs) files contain global maps of the full spectra, so they can be big! We’ll use the same earthaccess method to find the data.

results = earthaccess.search_data(
    short_name="PACE_OCI_L3M_RRS",
    temporal=tspan,
)
len(results)
28

The L3M collections contain granules at different temporal and spatial resolutions. Let’s take a closer look at one of the larger sized granules first, using the metadata available on each result.

for item in results:
    if item.size() > 2000:
        display(item)
        break

Data: PACE_OCI.20240321_20240620.L3m.SNSP.RRS.V3_1.Rrs.4km.nc

Size: 3150.73 MB

Cloud Hosted: True

Data Preview

Notice some special parts in the granule name:

  1. Right after L3m you have a period indicator. The value DAY means a daily aggregate, 8D means an 8-day aggregate, MO means a monthly aggregate, and SN(SP|SU|AU|WI) indicates one of four seasonal aggregates.

  2. Right after chlor_a you have a spatial resolution marker, as horizontal cell size at the equator given either in km or deg. Read p as a decimal point.

Within a specific collection, earthaccess can retrieve granules whose name matches a pattern given in the granule_name argument. Let’s use that to narrow our search to values aggregated to a month and 0.1 degree resolutions.

results = earthaccess.search_data(
    short_name="PACE_OCI_L3M_RRS",
    granule_name="*.MO.*.4km.*",
    temporal=tspan,
)
len(results)
1
paths = earthaccess.open(results)

L3M data do not have any groups, so we can open as a dataset rather than a datatree.

dataset = xr.open_dataset(paths[0])
dataset
<xarray.Dataset> Size: 26GB
Dimensions:     (lat: 4320, lon: 8640, wavelength: 172, rgb: 3,
                 eightbitcolor: 256)
Coordinates:
  * lat         (lat) float32 17kB 89.98 89.94 89.9 ... -89.9 -89.94 -89.98
  * lon         (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * wavelength  (wavelength) float64 1kB 346.0 348.0 351.0 ... 714.0 717.0 719.0
Dimensions without coordinates: rgb, eightbitcolor
Data variables:
    Rrs         (lat, lon, wavelength) float32 26GB ...
    palette     (rgb, eightbitcolor) uint8 768B ...
Attributes: (12/64)
    product_name:                      PACE_OCI.20240501_20240531.L3m.MO.RRS....
    instrument:                        OCI
    title:                             OCI Level-3 Standard Mapped Image
    project:                           Ocean Biology Processing Group (NASA/G...
    platform:                          PACE
    source:                            satellite observations from OCI-PACE
    ...                                ...
    identifier_product_doi:            10.5067/PACE/OCI/L3M/RRS/3.1
    keywords:                          Earth Science > Oceans > Ocean Optics ...
    keywords_vocabulary:               NASA Global Change Master Directory (G...
    data_bins:                         16899424
    data_minimum:                      -0.009997999
    data_maximum:                      0.09772633

Notice that this L3M granule has wavelength, lat, and lon under “Indexes”, so it’s easy to slice out a bounding box and map the “Rrs” variable at a given wavelength.

rrs = dataset["Rrs"]
rrs_440_bbox = rrs.sel({
    "wavelength": 440,
    "lat": slice(bbox[3], bbox[1]),
    "lon": slice(bbox[0], bbox[2]),
})
rrs_440_bbox
<xarray.DataArray 'Rrs' (lat: 49, lon: 24)> Size: 5kB
[1176 values with dtype=float32]
Coordinates:
  * lat         (lat) float32 196B 38.98 38.94 38.9 38.85 ... 37.06 37.02 36.98
  * lon         (lon) float32 96B -76.73 -76.69 -76.65 ... -75.85 -75.81 -75.77
    wavelength  float64 8B 440.0
Attributes:
    long_name:      Remote sensing reflectance
    units:          sr^-1
    standard_name:  surface_ratio_of_upwelling_radiance_emerging_from_sea_wat...
    valid_min:      -30000
    valid_max:      25000
    display_scale:  linear
    display_min:    0.0
    display_max:    0.025
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
im = rrs_440_bbox.plot(x="lon", y="lat", cmap="viridis", robust=True, ax=ax)
ax.gridlines(draw_labels={"left": "y", "bottom": "x"})
ax.coastlines()
plt.show()
../../_images/feb7450078aff9eb19caf14a9bcfa744d76098b67c019bf9ccb74213c98ad3a2.png

Also becuase the L3M variables have gridded lat and lon coordinates, it’s possible to stack multiple granules along a new dimension that corresponds to time. Using the granule_name filter, let’s get low spatial-resolution chlorophyll-a data to see daily changes.

results = earthaccess.search_data(
    short_name="PACE_OCI_L3M_CHL",
    temporal=tspan,
    granule_name="*.DAY.*.0p1deg.*",
)
paths = earthaccess.open(results)

Instead of xr.open_dataset, we use xr.open_mfdataset to create a single xarray.Dataset (the “mf” in open_mfdataset stands for multiple files) from an array of paths. The paths list is sorted temporally by default, which means the shape of the paths array specifies the way we need to tile the files together into larger arrays. We specify combine="nested" to combine the files according to the shape of the array of files (or file-like objects), even though paths is not a “nested” list in this case. The concat_dim="date" argument generates a new dimension in the combined dataset, because “date” is not an existing dimension in the individual files.

dataset = xr.open_mfdataset(
    paths,
    combine="nested",
    concat_dim="date",
)

Add a date variable as a coordinate using the dates from the netCDF files.

dates = [xr.open_dataset(i).attrs["time_coverage_end"] for i in paths]
dt = pd.to_datetime(dates)
dataset = dataset.assign_coords(date=dt.values)
dataset
<xarray.Dataset> Size: 181MB
Dimensions:  (date: 7, lat: 1800, lon: 3600, rgb: 3, eightbitcolor: 256)
Coordinates:
  * date     (date) datetime64[ns] 56B 2024-05-02T02:28:09.020000 ... 2024-05...
  * lat      (lat) float32 7kB 89.95 89.85 89.75 89.65 ... -89.75 -89.85 -89.95
  * lon      (lon) float32 14kB -179.9 -179.9 -179.8 ... 179.8 179.9 180.0
Dimensions without coordinates: rgb, eightbitcolor
Data variables:
    chlor_a  (date, lat, lon) float32 181MB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
    palette  (date, rgb, eightbitcolor) uint8 5kB dask.array<chunksize=(1, 3, 256), meta=np.ndarray>
Attributes: (12/64)
    product_name:                      PACE_OCI.20240501.L3m.DAY.CHL.V3_1.chl...
    instrument:                        OCI
    title:                             OCI Level-3 Standard Mapped Image
    project:                           Ocean Biology Processing Group (NASA/G...
    platform:                          PACE
    source:                            satellite observations from OCI-PACE
    ...                                ...
    identifier_product_doi:            10.5067/PACE/OCI/L3M/CHL/3.1
    keywords:                          Earth Science > Oceans > Ocean Chemist...
    keywords_vocabulary:               NASA Global Change Master Directory (G...
    data_bins:                         736046
    data_minimum:                      0.005030712
    data_maximum:                      98.41369

A common reason to generate a single dataset from multiple, daily images is to create a composite. Compare the map from a single day …

chla = np.log10(dataset["chlor_a"])
chla.attrs.update(
    {
        "units": f'log({dataset["chlor_a"].attrs["units"]})',
    }
)
im = chla.sel(date="2024-05-02").plot(aspect=2, size=4, cmap="GnBu_r")
../../_images/faa899fd05ec7fef677870b27cc50b351b1e74f7635ce0cd312754648b8ee0f2.png

… to a map of average values, which ignore missing values due to clouds.

chla_avg = chla.mean("date", keep_attrs=True)
im = chla_avg.plot(aspect=2, size=4, cmap="GnBu_r")
../../_images/d0fb0b60a0bc38c0e863b8fe90b77a02ae3d44d3f5cb817190ce41a9a7839358.png

We can also create a time series of mean values over the whole region.

chla_avg = chla.mean(dim=["lon", "lat"], keep_attrs=True)
im = chla_avg.plot(linestyle="-", marker="o", color="b")
../../_images/7261bf454e96a543135fe68d21ecde224a764d5692f06307f672dfbed33416a5.png