0. Getting Started with xarray
NOC Near-Present Day Hackathon¶
Tutorial 0. Getting Started with xarray for Ocean Science¶
Contact:¶
Ollie Tooth (oliver.tooth@noc.ac.uk)
Background:¶
Ocean models produce large, multi-dimensional fields (e.g. temperature, salinity, velocity) that vary in time and space and are described by physical coordinates (e.g., depth). Additionally, scalar and vector output variables are often stored on different, staggered model grids (e.g., Arakawa A-, B-, C-grids), meaning that we must often transform variables prior to performing any computations.
xarray is an open-source Python library which provides data structures that explicitly represent the physical coordinates, rather than treating model outputs as anonymous multi-dimensional arrays. In practice, this means we can perform dimension-aware computations using dimension labels, such as time, depth, etc., which automatically align variables stored on the same grid or sharing the same time dimension.
This Tutorial provides an introduction to xarray for ocean model analysis and is intended as a quick reference that you can return to during the hackathon and beyond.
0.1 Basics¶
xarray extends the NumPy and pandas Python libraries to handle labelled, multi-dimensional arrays.
Fundamental Ideas:
- Each dimensions has a name (e.g., "time", "depth", "lat", "lon").
- Coordinates are attached to dimensions (e.g., depth values in meters).
- Metadata (units, long_name, etc.) are preserved.
Key advantages:
- Labelled dimensions → No guessing axis order.
- Automatic alignment & broadcasting → Safer calculations.
- Lazy loading + Dask support → scalable to large datasets O(TBs).
- Seamless reading & writing of NetCDF files
- Integrated plotting features
Core Data Structures:
- DataArray: a single variable + coordinates + attributes
- Dataset: a collection of DataArrays sharing coordinates
- DataTree: a tree-like hierarchical collection of xarray DataArray or Dataset objects.
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
0.2 Opening an Example Dataset¶
xarray includes several example datasets.
Here, we will use the NOAA Extended Reconstructed global Sea Surface Temperature (SST) monthly averages dataset.
This is similar in structure to typical ocean model surface output:
- Dimensions:
time×lat×lon - Variable:
sst - Includes metadata and coordinates
ds = xr.tutorial.load_dataset("ersstv5")
ds
<xarray.Dataset> Size: 40MB
Dimensions: (time: 624, nbnds: 2, lat: 89, lon: 180)
Coordinates:
* time (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 ... 352.0 354.0 356.0 358.0
Dimensions without coordinates: nbnds
Data variables:
time_bnds (time, nbnds) float64 10kB 9.969e+36 9.969e+36 ... 9.969e+36
sst (time, lat, lon) float32 40MB -1.8 -1.8 -1.8 -1.8 ... nan nan nan
Attributes: (12/37)
climatology: Climatology is based on 1971-2000 SST, Xue, Y....
description: In situ data: ICOADS2.5 before 2007 and NCEP i...
keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sci...
keywords: Earth Science > Oceans > Ocean Temperature > S...
instrument: Conventional thermometers
source_comment: SSTs were observed by conventional thermometer...
... ...
creator_url_original: https://www.ncei.noaa.gov
license: No constraints on data access or use
comment: SSTs were observed by conventional thermometer...
summary: ERSST.v5 is developed based on v4 after revisi...
dataset_title: NOAA Extended Reconstructed SST V5
data_modified: 2022-06-07# Inspect dimensions, coordinates, variables
print(ds.dims)
print(ds.coords)
print(ds.data_vars)
FrozenMappingWarningOnValuesAccess({'time': 624, 'nbnds': 2, 'lat': 89, 'lon': 180})
Coordinates:
* time (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
Data variables:
time_bnds (time, nbnds) float64 10kB 9.969e+36 9.969e+36 ... 9.969e+36
sst (time, lat, lon) float32 40MB -1.8 -1.8 -1.8 -1.8 ... nan nan nan
0.3 Data Structures: Datasets & DataArrays¶
xarray has two core data structures:
Dataset
- Container of multiple variables stored as DataArrays
- Similar to a NetCDF file in memory
DataArray
DataArrays contain underlying arrays and associated metadata:
- Name
- Dimension names
- Coordinate variables
- and arbitrary attributes.
sst = ds["sst"]
sst.dims, sst.shape
(('time', 'lat', 'lon'), (624, 89, 180))
sst.coords
Coordinates: * time (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01 * lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0 * lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
sst.attrs
{'long_name': 'Monthly Means of Sea Surface Temperature',
'units': 'degC',
'var_desc': 'Sea Surface Temperature',
'level_desc': 'Surface',
'statistic': 'Mean',
'dataset': 'NOAA Extended Reconstructed SST V5',
'parent_stat': 'Individual Values',
'actual_range': array([-1.8 , 42.32636], dtype=float32),
'valid_range': array([-1.8, 45. ], dtype=float32)}
0.4 Indexing & Selection¶
Numeric Indexing (NumPy-style)
- Uses integer positions
- Fragile if dimension order changes
Label-Based Indexing (xarray-style)
- Uses dimension names
- Safer and more readable
# Selecting the first time slice with numerical indexing:
sst.isel(time=0)
<xarray.DataArray 'sst' (lat: 89, lon: 180)> Size: 64kB
array([[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
shape=(89, 180), dtype=float32)
Coordinates:
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
time datetime64[ns] 8B 1970-01-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]0.4.1 Temporal Subsetting¶
# Selecting a time slice by datetime string (e.g., "YYYY-MM-DD"):
sst.sel(time="1990-01-01")
<xarray.DataArray 'sst' (lat: 89, lon: 180)> Size: 64kB
array([[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
shape=(89, 180), dtype=float32)
Coordinates:
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
time datetime64[ns] 8B 1990-01-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]0.4.2 Spatial Subsetting¶
# Selecting a range of longitudes and latitudes in the tropical Pacific Ocean:
trop_pac = sst.sel(
lat=slice(-10, 10),
lon=slice(160, 260)
)
trop_pac
<xarray.DataArray 'sst' (time: 624, lat: 0, lon: 51)> Size: 0B
array([], shape=(624, 0, 51), dtype=float32)
Coordinates:
* time (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01
* lat (lat) float32 0B
* lon (lon) float32 204B 160.0 162.0 164.0 166.0 ... 256.0 258.0 260.0
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]# Plot initial time slice of global sea surface temperature field:
sst.isel(time=0).plot()
plt.title("Global SST (first time slice)")
plt.show()
# Plot time-series of sea surface temperature at grid point nearest to (200°E, 0°N):
sst_eq = sst.sel(lat=0, lon=200, method="nearest")
sst_eq.plot()
plt.title("Equatorial SST time series")
plt.show()
# Calculate climatology of sea surface temperature:
sst_mean = sst.mean(dim="time")
sst_mean.plot()
plt.title("Mean SST")
plt.show()
# Calculate zonal mean sea surface temperature for each month:
sst_zonal = sst.mean(dim="lon")
# Plot Hovmoller of monthly zonal mean sea surface temperature:
sst_zonal.plot()
plt.title("Zonal mean SST")
plt.show()
0.6.2 High-Level Operations: Group-By¶
xarray has some very useful high level objects that let you do common computations:
groupby: Bin data in to groups and reduceresample: Groupby specialized for time axes. Either downsample or upsample your data.rolling: Operate on rolling windows of your data e.g. running meancoarsen: Downsample your dataweighted: Weight your data before reducing
Group-By is essential for:
- Seasonal cycles
- Climatologies
- Anomaly calculations
# Calculate monthly climatologies of sea surface temperature:
sst_clim = sst.groupby("time.month").mean(dim="time")
sst_clim
<xarray.DataArray 'sst' (month: 12, lat: 89, lon: 180)> Size: 769kB
array([[[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
...,
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan]],
[[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
...
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan]],
[[-1.7994802, -1.7995852, -1.7998399, ..., -1.7997632,
-1.7996233, -1.7994753],
[-1.7995453, -1.7997496, -1.8000005, ..., -1.8000005,
-1.799799 , -1.7995732],
[-1.8000005, -1.8000005, -1.8000005, ..., -1.8000005,
-1.8000005, -1.8000005],
...,
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan]]], shape=(12, 89, 180), dtype=float32)
Coordinates:
* month (month) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]xarray also lets you easily visualize 3D and 4D datasets by presenting multiple facets (or panels or subplots) showing variations across rows and/or columns.
# Calculate SST anomalies by removing the climatology:
sst_anom = sst.groupby("time.month") - sst_clim
sst_anom
<xarray.DataArray 'sst' (time: 624, lat: 89, lon: 180)> Size: 40MB
array([[[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
...,
[ nan, nan, nan, ...,
nan, nan, nan],
[ nan, nan, nan, ...,
nan, nan, nan],
[ nan, nan, nan, ...,
nan, nan, nan]],
[[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
...
nan, nan, nan],
[ nan, nan, nan, ...,
nan, nan, nan],
[ nan, nan, nan, ...,
nan, nan, nan]],
[[-5.19752502e-04, -4.14729118e-04, -1.60098076e-04, ...,
-2.36749649e-04, -3.76701355e-04, -5.24640083e-04],
[-4.54664230e-04, -2.50339508e-04, 5.96046448e-07, ...,
5.96046448e-07, -2.00986862e-04, -4.26769257e-04],
[ 5.96046448e-07, 5.96046448e-07, 5.96046448e-07, ...,
5.96046448e-07, 5.96046448e-07, 5.96046448e-07],
...,
[ nan, nan, nan, ...,
nan, nan, nan],
[ nan, nan, nan, ...,
nan, nan, nan],
[ nan, nan, nan, ...,
nan, nan, nan]]],
shape=(624, 89, 180), dtype=float32)
Coordinates:
* time (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
month (time) int64 5kB 1 2 3 4 5 6 7 8 9 10 11 ... 3 4 5 6 7 8 9 10 11 12
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]0.6.3 High-Level Operations: Weighted¶
xarray supports weighted reductions.
Let's create a “weights” array proportional to the cosine of latitude, which is equivalent to an area-weighting factor for data stored on a regular latitude-longitude grid.
# Define weights using cosine of latitude:
weights = np.cos(np.deg2rad(ds.lat))
# Create a DataArrayWeighted object:
sst_weighted = ds.sst.weighted(weights)
sst_weighted
DataArrayWeighted with weights along dimensions: lat
# Calculate the time-series of area-weighted global mean SST & plot:
sst_weighted.mean(dim=("lon", "lat")).plot()
plt.title("Weighted Global Mean SST")