Scrape Datasets from APEBench¤

APEBench is designed to tightly integrate its differentiable solver framework and hence (procedurally) regenerates the training data for each run. This notebook shows you how to export the generated arrays programmatically to use them in different settings like with PyTorch.

import jax.numpy as jnp

import apebench

Reading from a scenario¤

Let's instantiate the default scenario for 1d advection in difficulty mode.

advection_1d_difficulty = apebench.scenarios.difficulty.Advection()

Using the methods get_train_data() and get_test_data() procedurally generates the corresponding JAX arrays.

train_data = advection_1d_difficulty.get_train_data()
test_data = advection_1d_difficulty.get_test_data()

train_data.shape, test_data.shape

((50, 51, 1, 160), (30, 201, 1, 160))

From here on, you could use your preferred way to serialize the data or use it further in your application.

# jnp.save("advection_1d_train_data.npy", train_data)
# jnp.save("advection_1d_test_data.npy", test_data)

Modifiying the scenario¤

The important attributes that affect the size of the generated data are:

num_train_samples
train_temporal_horizon
num_test_samples
test_temporal_horizon

Additionally, the num_spatial_dims, num_points, and num_channels affect the latter axes in the data arrays.

The seed for data generation can altered by:

train_seed
test_seed

modified_advection_1d_difficulty = apebench.scenarios.difficulty.Advection(
    num_train_samples=81,
    train_temporal_horizon=42,
    train_seed=-1,
    num_test_samples=3,
    test_temporal_horizon=101,
    test_seed=-3,
)

modified_train_data = modified_advection_1d_difficulty.get_train_data()
modified_test_data = modified_advection_1d_difficulty.get_test_data()

modified_train_data.shape, modified_test_data.shape

((81, 43, 1, 160), (3, 102, 1, 160))

Exporting Metadata¤

To get additional information on the data, it can be helpful to extract the attributes of the scenario. Since each scenario is a dataclass, its members can easily be converted into a dictionary.

Let's first print the representation of the scenario

modified_advection_1d_difficulty

Advection(
  num_spatial_dims=1,
  num_points=160,
  num_channels=1,
  ic_config='fourier;5;true;true',
  num_warmup_steps=0,
  num_train_samples=81,
  train_temporal_horizon=42,
  train_seed=-1,
  num_test_samples=3,
  test_temporal_horizon=101,
  test_seed=-3,
  optim_config='adam;10_000;warmup_cosine;0.0;1e-3;2_000',
  batch_size=20,
  num_trjs_returned=1,
  record_loss_every=100,
  vlim=(-1.0, 1.0),
  report_metrics='mean_nRMSE',
  callbacks='',
  gammas=(0.0, -4.0, 0.0, 0.0, 0.0),
  coarse_proportion=0.5,
  advection_gamma=-4.0
)

Then import the function form the dataclasses module and convert the scenario to a dictionary.

from dataclasses import asdict

modified_metadata = asdict(modified_advection_1d_difficulty)

modified_metadata

{'num_spatial_dims': 1,
 'num_points': 160,
 'num_channels': 1,
 'ic_config': 'fourier;5;true;true',
 'num_warmup_steps': 0,
 'num_train_samples': 81,
 'train_temporal_horizon': 42,
 'train_seed': -1,
 'num_test_samples': 3,
 'test_temporal_horizon': 101,
 'test_seed': -3,
 'optim_config': 'adam;10_000;warmup_cosine;0.0;1e-3;2_000',
 'batch_size': 20,
 'num_trjs_returned': 1,
 'record_loss_every': 100,
 'vlim': (-1.0, 1.0),
 'report_metrics': 'mean_nRMSE',
 'callbacks': '',
 'gammas': (0.0, -4.0, 0.0, 0.0, 0.0),
 'coarse_proportion': 0.5,
 'advection_gamma': -4.0}

You can dump this data to a JSON file or use it in any other way you see fit.

# import json
# with open("modified_advection_1d_difficulty.json", "w") as f:
#     json.dump(modified_metadata, f)

Using the scraping API¤

APEBench provides a structured way to get train data, test data, and metadata from a scenario.

train_data_ks, test_data_ks, meta_data_ks = apebench.scraper.scrape_data_and_metadata(
    scenario="diff_ks"
)

train_data_ks.shape, test_data_ks.shape

((50, 51, 1, 160), (30, 201, 1, 160))

meta_data_ks

{'name': '1d_diff_ks',
 'info': {'num_spatial_dims': 1,
  'num_points': 160,
  'num_channels': 1,
  'ic_config': 'fourier;5;true;true',
  'num_warmup_steps': 500,
  'num_train_samples': 50,
  'train_temporal_horizon': 50,
  'train_seed': 0,
  'num_test_samples': 30,
  'test_temporal_horizon': 200,
  'test_seed': 773,
  'optim_config': 'adam;10_000;warmup_cosine;0.0;1e-3;2_000',
  'batch_size': 20,
  'num_trjs_returned': 1,
  'record_loss_every': 100,
  'vlim': (-6.5, 6.5),
  'report_metrics': 'mean_nRMSE,mean_correlation',
  'callbacks': '',
  'gammas': (0.0, 0.0, -1.2, 0.0, -15.0),
  'deltas': (0.0, 0.0, -6.0),
  'num_substeps': 1,
  'coarse_proportion': 0.5,
  'order': 2,
  'dealiasing_fraction': 0.6666666666666666,
  'num_circle_points': 16,
  'circle_radius': 1.0,
  'gradient_norm_delta': -6.0,
  'diffusion_gamma': -1.2,
  'hyp_diffusion_gamma': -15.0}}

You can provide any keyword argument that matches the attributes of the scenario to modify the produced data. Let's decrease the resolution.

apebench.scraper.scrape_data_and_metadata(scenario="diff_ks", num_points=64)[0].shape

(50, 51, 1, 64)

Having the scraper write to disk¤

If you provide a folder name, the scrape will not return the data but writes it as .npy files to disk and dumps the metadata as a JSON file.

# apebench.scraper.scrape_data_and_metadata(".", scenario="diff_ks")
# Creates the following files:
# 1d_diff_ks_train.npy
# 1d_diff_ks_test.npy
# 1d_diff_ks.json

Creating a collection of datasets¤

You can loop over a list of dictionaries that contain scenarios and additional attributes to create a collection of datasets.

Your scenario name must match the short identifier as detailed in apebench.scenarios.scenario_dict.

# scenario_list = [
#     {"scenario": "diff_adv", "num_train_samples": 81},
#     {"scenario": "diff_ks", "num_points": 64},
# ]

# for scenario in scenario_list:
#     apebench.scraper.scrape_data_and_metadata(".", **scenario)

Export of curated lists¤

APEBench comes with a curation of scenarios, for example the set of data used for the original APEBench paper.

The export for CURATION_APEBENCH_V1 should take ~3min on a modern GPU and should produce ~40GB of data.

# from tqdm import tqdm
# import os

# DATA_PATH = "data"

# os.makedirs(DATA_PATH, exist_ok=True)

# for config in tqdm(apebench.scraper.CURATION_APEBENCH_V1):
#     apebench.scraper.scrape_data_and_metadata(DATA_PATH, **config)

100%|██████████| 46/46 [02:59<00:00,  3.91s/it]