Key Concepts

This page explains the core abstractions of the Slices AI infrastructure.

What is a Docker container?

A Docker container is a self-contained, isolated environment that includes the operating system layer, the CUDA toolkit, Python, and all your Python libraries — packaged together as a single unit. Think of it as a lightweight virtual machine that starts in seconds.

When a job runs on the Slices AI infrastructure, your code executes inside the container, not directly on the host machine. The container guarantees that the same software environment runs on every cluster node, regardless of what other users have installed.

If you usually install packages interactively with pip install or conda install (in a notebook cell or terminal), the container model asks you to front-load that work: your dependencies are either in a pre-built image, or installed at job start via the job’s command. In return you get full reproducibility and no interference with other users’ environments.

Docker image vs. running container

  • A Docker image is a read-only snapshot — the blueprint for your environment. It lives in a container registry (think of it like a package repository for whole environments).

  • A container is a running instance of an image. When your job starts, the Slices AI infrastructure pulls the image and launches a container from it.

The Slices AI infrastructure maintains a set of ready-to-use images for common ML frameworks under GPU Docker Stacks. For most use cases you can pick one of these pre-built images and skip building your own entirely. See Choosing a Docker image for a full comparison of available image sources.

Where does your code live?

Your training scripts are not baked into the image. Instead, you place them on a storage volume that is mounted into the container at job start.

A typical layout for the CIFAR-10 tutorial is:

/project_ghent/
├── cifar10/
│   └── train.py      ← your script (lives in project storage, survives across jobs)
├── data/             ← dataset written here by the script on first run
└── checkpoints/      ← per-epoch checkpoints written here by the script

Your job definition mounts /project_ghent into the container, and the command in the job definition runs your script. The image provides the runtime; your code and data live in storage.

This separation means you can update your script without rebuilding or re-pulling the image.

How does data get in and out?

Data only persists if it is written to a mounted storage volume. Anything written to the container’s own filesystem (e.g. files written to /tmp or /root) is discarded when the job ends.

The Slices AI infrastructure provides five storage types:

Storage

Persistent

Shared across jobs

Location

Speed

Use case

/project_ghent

Yes

Yes (Ghent clusters 0–99)

Ghent

Fast

Scripts, datasets, checkpoints, logs

/project_antwerp

Yes

Yes (Antwerp clusters 100–199)

Antwerp

Fast

Same as above, Antwerp side

/project_madrid

Yes

Yes (Madrid clusters 200–299)

Madrid

Fast

Same as above, Madrid side

S3 object store

Yes

Yes (all clusters, both sites)

Mirrored Ghent ↔ Antwerp

Good

Large datasets accessed from any cluster

/project_scratch

Yes

Yes (same slave node only)

Local SSD (RAID0)

Very fast

High-frequency I/O; not backed up

tmpfs

No

No

RAM

Extremely fast

Hot copy of a dataset within a single job

S3 object storage is particularly useful for large datasets that need to be read from multiple clusters or both sites. It uses the standard S3 API (compatible with boto3, s3cmd, rclone, etc.). See the S3 storage documentation for setup instructions.

Important

Everything written outside a mounted volume is lost when the job ends. Always write checkpoints, logs, and results to a mounted storage path.

Installing Python packages

For quick installs: pip install works inside a running container — both in JupyterHub notebook cells and in the command field of a job definition. For example:

"command": "pip install einops && python /project_ghent/train.py"

This is fine for one or two small packages in a personal workflow.

For reproducible or complex environments: pre-installing dependencies into a custom Docker image is the better approach. Dependencies are installed once at build time, the image is pushed to a registry, and every job that uses the image starts instantly with a complete, tested environment. See the Job Definition reference for a full walkthrough of building and pushing a custom image.

A custom image is the right choice when:

  • You have many dependencies or a long pip install step that would slow every job start.

  • Your experiment needs exact, reproducible versions pinned across collaborators.

  • You depend on compiled C/C++ extensions or non-Python system packages (apt install).

What is a job definition?

A job definition is a JSON file that describes everything the Slices AI infrastructure needs to run your experiment:

  • which Docker image to use,

  • which GPUs, CPUs and RAM to allocate,

  • which storage volumes to mount,

  • and which command to execute.

You submit the job definition with slices ai submit <file.json> and the infrastructure schedules it on an appropriate cluster node. See the Job Definition reference for all available fields.