Skip to content

Utilities for Scraping Datasets from APEBench¤

Use these functions if you want to procedurally scrape datasets from APEBench to then use outside of the APEBench ecosystem, e.g., for training/testing supervised models in PyTorch or in JAX with other deep learning frameworks than Equinox.

APEBench is designed to procedurally generate its data with fixed random seeds by relying on JAX' explicit treatment of randomness.

However, this determinism can only be relied on if the code is executed with the same JAX version number and on the same backend (likely also using the same driver version). Beyond that, some low-level routines within CUDA experience some non-determinism (for performance reasons) which can be deactivated.

apebench.scraper.scrape_data_and_metadata ¤

scrape_data_and_metadata(
    folder: str = None,
    *,
    scenario: str,
    name: str = "auto",
    **scenario_kwargs
)

Produce train data, test data, and metadata for a given scenario. Optionally write them to disk.

Arguments:

  • folder (str, optional): Folder to save the data and metadata to. If None, returns the data and metadata as jax arrays and a dictionary, respectively.
  • scenario (str): Name of the scenario to produce data for. Must be one of apebench.scenarios.scenario_dict.
  • name (str, optional): Name of the scenario. If "auto", the name is automatically generated based on the scenario and its additional arguments.
  • **scenario_kwargs: Additional arguments to pass to the scenario. All attributes of a scenario can be modified by passing them as keyword arguments.
Source code in apebench/_scraper.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
def scrape_data_and_metadata(
    folder: str = None,
    *,
    scenario: str,
    name: str = "auto",
    **scenario_kwargs,
):
    """
    Produce train data, test data, and metadata for a given scenario. Optionally
    write them to disk.

    **Arguments:**

    - `folder` (str, optional): Folder to save the data and metadata to. If
        None, returns the data and metadata as jax arrays and a dictionary,
        respectively.
    - `scenario` (str): Name of the scenario to produce data for. Must be one of
        `apebench.scenarios.scenario_dict`.
    - `name` (str, optional): Name of the scenario. If "auto", the name is
        automatically generated based on the scenario and its additional
        arguments.
    - `**scenario_kwargs`: Additional arguments to pass to the scenario. All
        attributes of a scenario can be modified by passing them as keyword
        arguments.
    """
    scenario = scenario_dict[scenario](**scenario_kwargs)
    if name == "auto":
        name = scenario.get_scenario_name()

        additional_infos = []
        for key, value in scenario_kwargs.items():
            additional_infos.append(f"{key}={value}")
        if len(additional_infos) > 0:
            additional_infos = ", ".join(additional_infos)
            additional_infos = "__" + additional_infos
        else:
            additional_infos = ""

        name += additional_infos

    logging.info(f"Producing train data for {name}")
    train_data = scenario.get_train_data()
    train_num_nans = jnp.sum(jnp.isnan(train_data))
    if train_num_nans > 0:
        logging.warning(f"Train data contains {train_num_nans} NaNs")

    logging.info(f"Producing test data for {name}")
    test_data = scenario.get_test_data()
    test_num_nans = jnp.sum(jnp.isnan(test_data))
    if test_num_nans > 0:
        logging.warning(f"Test data contains {test_num_nans} NaNs")

    info = asdict(scenario)

    metadata = {
        "name": name,
        "info": info,
    }

    if folder is not None:
        with open(f"{folder}/{name}.json", "w") as f:
            json.dump(metadata, f)
        jnp.save(f"{folder}/{name}_train.npy", train_data)
        jnp.save(f"{folder}/{name}_test.npy", test_data)

        del train_data, test_data
    else:
        return train_data, test_data, metadata