Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Creating Model CSV Files with intake-esgf

This tutorial will teach you how to download CMIP data and create CSV files which ilamb3 requires for execution using intake-esgf. In order to stay as general as possible, ilamb3 does not depend directly on intake-esgf but it is a useful tool for easily accessing data hosted on the Earth System Grid Federation (ESGF). For a more thorough explanation of that package and all its options, please consult the intake-esgf documentation.

intake-esgf catalogs initialize empty and are populated by writing a faceted search.

from intake_esgf import ESGFCatalog

cat = ESGFCatalog().search(
    experiment_id="historical",
    source_id="CanESM5",
    variable_id=["gpp", "areacella", "sftlf"],
    frequency=["mon", "fx"],
    file_start="1980-01",
    file_end="2016-01",
)
cat
Loading...
Summary information for 195 results: mip_era [CMIP6] activity_drs [CMIP] institution_id [CCCma] source_id [CanESM5] experiment_id [historical] member_id [r10i1p1f1, r10i1p2f1, r11i1p1f1, r11i1p2f1, r... table_id [Lmon, fx] variable_id [gpp, areacella, sftlf] grid_label [gn] dtype: object

From the catalog summary, we see that there are many ensemble members and we only want a single member for this run. The catalog has a function that can be used to remove all ensembles except the smallest.

cat.remove_ensembles()
cat
Summary information for 3 results: mip_era [CMIP6] activity_drs [CMIP] institution_id [CCCma] source_id [CanESM5] experiment_id [historical] member_id [r1i1p1f1] table_id [Lmon, fx] variable_id [gpp, areacella, sftlf] grid_label [gn] dtype: object

Once the catalog represents the data that you wish to download and use in your benchmarking study, we ask the catalog for a dictionary of paths.

dpd = cat.to_path_dict(minimal_keys=False)
dpd
Loading...
{'CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Lmon.gpp.gn': [PosixPath('/home/docs/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Lmon/gpp/gn/v20190429/gpp_Lmon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc')], 'CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.fx.areacella.gn': [PosixPath('/home/docs/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc')], 'CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.fx.sftlf.gn': [PosixPath('/home/docs/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/sftlf/gn/v20190429/sftlf_fx_CanESM5_historical_r1i1p1f1_gn.nc')]}

We have used the keyword argument minimal_keys=False so that the keys of the dictionary contain all the facets which define a unique dataset. We can use these keys and the known order they occur in to create a pandas DataFrame with the required columns for ilamb3.

import pandas as pd

# The order in which facets appear in the keys
KEY_PATTERN = [
    "mip_era",
    "activity_id",
    "institution_id",
    "source_id",
    "experiment_id",
    "member_id",
    "table_id",
    "variable_id",
    "grid_label",
]
# Create each row of the dataframe
df = []
for key, paths in dpd.items():
    row = {col: value for col, value in zip(KEY_PATTERN, key.split("."))}
    for path in paths:
        row["path"] = str(path)
        df.append(row)
df = pd.DataFrame(df)
# Export as a CSV
df.to_csv("CanESM5.csv")

This produces a CSV file which looks like this:

Loading...

While you need not store each model’s output as a separate CSV file, this is a useful convention so that including/excluding any model from a study is simple.