Key Concepts¶
This page explains the core abstractions of the Slices AI infrastructure.
What is a Docker container?¶
A Docker container is a self-contained, isolated environment that includes the operating system layer, the CUDA toolkit, Python, and all your Python libraries — packaged together as a single unit. Think of it as a lightweight virtual machine that starts in seconds.
When a job runs on the Slices AI infrastructure, your code executes inside the container, not directly on the host machine. The container guarantees that the same software environment runs on every cluster node, regardless of what other users have installed.
If you usually install packages interactively with pip install or conda install (in a notebook
cell or terminal), the container model asks you to front-load that work: your dependencies are either
in a pre-built image, or installed at job start via the job’s command. In return you get full
reproducibility and no interference with other users’ environments.
Docker image vs. running container¶
A Docker image is a read-only snapshot — the blueprint for your environment. It lives in a container registry (think of it like a package repository for whole environments).
A container is a running instance of an image. When your job starts, the Slices AI infrastructure pulls the image and launches a container from it.
The Slices AI infrastructure maintains a set of ready-to-use images for common ML frameworks under GPU Docker Stacks. For most use cases you can pick one of these pre-built images and skip building your own entirely. See Choosing a Docker image for a full comparison of available image sources.
Where does your code live?¶
Your training scripts are not baked into the image. Instead, you place them on a storage volume that is mounted into the container at job start.
A typical layout for the CIFAR-10 tutorial is:
/project_ghent/
├── cifar10/
│ └── train.py ← your script (lives in project storage, survives across jobs)
├── data/ ← dataset written here by the script on first run
└── checkpoints/ ← per-epoch checkpoints written here by the script
Your job definition mounts /project_ghent into the container, and the command in the job
definition runs your script. The image provides the runtime; your code and data live in storage.
This separation means you can update your script without rebuilding or re-pulling the image.
How does data get in and out?¶
Data only persists if it is written to a mounted storage volume. Anything written to the
container’s own filesystem (e.g. files written to /tmp or /root) is discarded when the job
ends.
The Slices AI infrastructure provides five storage types:
Storage |
Persistent |
Shared across jobs |
Location |
Speed |
Use case |
|---|---|---|---|---|---|
|
Yes |
Yes (Ghent clusters 0–99) |
Ghent |
Fast |
Scripts, datasets, checkpoints, logs |
|
Yes |
Yes (Antwerp clusters 100–199) |
Antwerp |
Fast |
Same as above, Antwerp side |
|
Yes |
Yes (Madrid clusters 200–299) |
Madrid |
Fast |
Same as above, Madrid side |
Yes |
Yes (all clusters, both sites) |
Mirrored Ghent ↔ Antwerp |
Good |
Large datasets accessed from any cluster |
|
|
Yes |
Yes (same slave node only) |
Local SSD (RAID0) |
Very fast |
High-frequency I/O; not backed up |
|
No |
No |
RAM |
Extremely fast |
Hot copy of a dataset within a single job |
S3 object storage is particularly useful for large datasets that need to be read from multiple
clusters or both sites. It uses the standard S3 API (compatible with boto3, s3cmd,
rclone, etc.). See the S3 storage documentation for setup instructions.
Important
Everything written outside a mounted volume is lost when the job ends. Always write checkpoints, logs, and results to a mounted storage path.
Installing Python packages¶
For quick installs: pip install works inside a running container — both in JupyterHub
notebook cells and in the command field of a job definition. For example:
"command": "pip install einops && python /project_ghent/train.py"
This is fine for one or two small packages in a personal workflow.
For reproducible or complex environments: pre-installing dependencies into a custom Docker image is the better approach. Dependencies are installed once at build time, the image is pushed to a registry, and every job that uses the image starts instantly with a complete, tested environment. See the Job Definition reference for a full walkthrough of building and pushing a custom image.
A custom image is the right choice when:
You have many dependencies or a long
pip installstep that would slow every job start.Your experiment needs exact, reproducible versions pinned across collaborators.
You depend on compiled C/C++ extensions or non-Python system packages (
apt install).
What is a job definition?¶
A job definition is a JSON file that describes everything the Slices AI infrastructure needs to run your experiment:
which Docker image to use,
which GPUs, CPUs and RAM to allocate,
which storage volumes to mount,
and which command to execute.
You submit the job definition with slices ai submit <file.json> and the infrastructure schedules
it on an appropriate cluster node. See the Job Definition reference for all
available fields.