Creating Model CSV Files with intake-esgf

This tutorial will teach you how to download CMIP data and create CSV files which ilamb3 requires for execution using intake-esgf. In order to stay as general as possible, ilamb3 does not depend directly on intake-esgf but it is a useful tool for easily accessing data hosted on the Earth System Grid Federation (ESGF). For a more thorough explanation of that package and all its options, please consult the intake-esgf documentation.

intake-esgf catalogs initialize empty and are populated by writing a faceted search.

from intake_esgf import ESGFCatalog

cat = ESGFCatalog().search(
    experiment_id="historical",
    source_id="CanESM5",
    variable_id=["gpp", "areacella", "sftlf"],
    frequency=["mon", "fx"],
    file_start="1980-01",
    file_end="2016-01",
)
cat

Summary information for 195 results:
mip_era                                                     [CMIP6]
activity_drs                                                 [CMIP]
institution_id                                              [CCCma]
source_id                                                 [CanESM5]
experiment_id                                          [historical]
member_id         [r10i1p1f1, r10i1p2f1, r11i1p1f1, r11i1p2f1, r...
table_id                                                 [Lmon, fx]
variable_id                                 [gpp, areacella, sftlf]
grid_label                                                     [gn]
dtype: object

From the catalog summary, we see that there are many ensemble members and we only want a single member for this run. The catalog has a function that can be used to remove all ensembles except the smallest.

cat.remove_ensembles()
cat

Summary information for 3 results:
mip_era                           [CMIP6]
activity_drs                       [CMIP]
institution_id                    [CCCma]
source_id                       [CanESM5]
experiment_id                [historical]
member_id                      [r1i1p1f1]
table_id                       [Lmon, fx]
variable_id       [gpp, areacella, sftlf]
grid_label                           [gn]
dtype: object

Once the catalog represents the data that you wish to download and use in your benchmarking study, we ask the catalog for a dictionary of paths.

dpd = cat.to_path_dict(minimal_keys=False)
dpd

{'CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Lmon.gpp.gn': [PosixPath('/home/docs/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Lmon/gpp/gn/v20190429/gpp_Lmon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc')],
 'CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.fx.areacella.gn': [PosixPath('/home/docs/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc')],
 'CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.fx.sftlf.gn': [PosixPath('/home/docs/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/sftlf/gn/v20190429/sftlf_fx_CanESM5_historical_r1i1p1f1_gn.nc')]}

We have used the keyword argument minimal_keys=False so that the keys of the dictionary contain all the facets which define a unique dataset. We can use these keys and the known order they occur in to create a pandas DataFrame with the required columns for ilamb3.

import pandas as pd

# The order in which facets appear in the keys
KEY_PATTERN = [
    "mip_era",
    "activity_id",
    "institution_id",
    "source_id",
    "experiment_id",
    "member_id",
    "table_id",
    "variable_id",
    "grid_label",
]
# Create each row of the dataframe
df = []
for key, paths in dpd.items():
    row = {col: value for col, value in zip(KEY_PATTERN, key.split("."))}
    for path in paths:
        row["path"] = str(path)
        df.append(row)
df = pd.DataFrame(df)
# Export as a CSV
df.to_csv("CanESM5.csv")

This produces a CSV file which looks like this:

While you need not store each model’s output as a separate CSV file, this is a useful convention so that including/excluding any model from a study is simple.