Getting Started

This tutorial walks you from a fresh account to a running PyTorch training job on the Slices AI infrastructure. It uses a single running example throughout: training a ResNet-18 on CIFAR-10.

By the end you will have:

  • verified GPU access interactively in JupyterHub,

  • submitted a non-interactive batch training job via the CLI,

  • and retrieved the results from shared project storage.

Note

How the system works in one paragraph. The Slices AI infrastructure runs your code inside a Docker container — a self-contained bundle that includes the OS, CUDA toolkit, Python, and all your libraries. You pick a pre-built image (for example, one that already contains PyTorch), attach persistent storage volumes for your data and checkpoints, and specify the command to run. The system schedules the job on a GPU node, runs it, and stores the output. Nothing is installed interactively at runtime; your environment is fully defined by the image you choose.

New to Docker? This is explained in more detail in the concepts page.

Step 1 — Get a project

The Slices AI infrastructure requires you to be a member of a project on the Slices Portal.

  • Request a new project (e.g. a project dedicated to your PhD research), or

  • Request to join an existing project if you are collaborating with others.

Each project gets its own separate storage on the Slices AI infrastructure that all project members share.

Step 2 — Explore the Slices AI infrastructure interactively with JupyterHub

Before writing a batch job, get familiar with the environment interactively. JupyterHub launches a notebook session directly on the GPU hardware — the same hardware your batch jobs will use — so you can prototype, inspect data, and iterate quickly without the submit-wait-check cycle.

  1. Go to https://jupyterhub.ai.slices-ri.eu and log in.

  2. On the spawn page, select:

    • Project: your project.

    • Docker image: PyTorch Stack - CUDA12 image (gitlab.ilabt.imec.be:4567/ilabt/gpu-docker-stacks/pytorch-notebook:cuda12.6-latest)

    • Resources: 1 GPU, 2 CPUs, 4 GB RAM.

    • Storage: mount /project_ghent (Ghent-based clusters) or /project_antwerp (Antwerp-based clusters). This persistent storage is where you will save scripts and outputs.

  3. Click Start.

  4. In a notebook cell, confirm that PyTorch can see the GPU you requested:

    import torch
    print("CUDA available:", torch.cuda.is_available())
    print("GPU name:", torch.cuda.get_device_name(0))
    

    Expected output:

    CUDA available: True
    GPU name: NVIDIA GeForce GTX 1080 Ti
    

    If CUDA available is False, double-check that you selected at least 1 GPU on the spawn page and that you used a GPU-enabled image.

  5. Try a minimal training step to confirm end-to-end GPU execution:

    import torch, torchvision, torchvision.transforms as T
    
    device = torch.device("cuda")
    model = torchvision.models.resnet18().to(device)
    x = torch.randn(4, 3, 224, 224, device=device)
    y = model(x)
    print("Forward pass shape:", y.shape)  # torch.Size([4, 1000])
    

    If this runs without error you have confirmed GPU access end-to-end. You can now close the JupyterHub session (File Hub Control Panel Stop My Server) and move on.

Step 3 — Write your training script

Note

For the rest of the tutorial, we will use the cheap 1080Ti GPUs which are available in cluster 4, located in Ghent. This means that we will use the /project_ghent permanent storage for all our files.

You can adapt the tutorial to other clusters and GPUs, but you will need to adjust the storage paths.

Save the script below as /project_ghent/cifar10/train.py from a JupyterHub terminal (File New Terminal), or transfer it with rsync/SFTP from your laptop (see the CLI reference for file transfer commands).

train.py — ResNet-18 on CIFAR-10
"""
ResNet-18 on CIFAR-10 — Slices AI infrastructure tutorial example
------------------------------------------------------------------
Saves checkpoints to /project/checkpoints/ after every epoch.
Resumes automatically from the latest checkpoint on restart.
Downloads CIFAR-10 to /project/data/ on first run.

Usage (inside the job container):
    python /project/cifar10/train.py
"""

import os
import glob
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as T

# ── Paths (all under the mounted project storage) ──────────────────────────────
DATA_DIR = "/project_ghent/data"
CKPT_DIR = "/project_ghent/checkpoints"
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(CKPT_DIR, exist_ok=True)

# ── Hyperparameters ────────────────────────────────────────────────────────────
EPOCHS = 10
BATCH_SIZE = 128
LR = 0.01

# ── Data ───────────────────────────────────────────────────────────────────────
transform_train = T.Compose([
    T.RandomCrop(32, padding=4),
    T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
transform_val = T.Compose([
    T.ToTensor(),
    T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

train_set = torchvision.datasets.CIFAR10(
    root=DATA_DIR, train=True, download=True, transform=transform_train)
val_set = torchvision.datasets.CIFAR10(
    root=DATA_DIR, train=False, download=True, transform=transform_val)

train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True)
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=256, shuffle=False, num_workers=4, pin_memory=True)

# ── Model ──────────────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = torchvision.models.resnet18(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=LR,
                      momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)

# ── Resume from checkpoint if one exists ──────────────────────────────────────
start_epoch = 0
checkpoints = sorted(glob.glob(os.path.join(CKPT_DIR, "epoch_*.pt")))
if checkpoints:
    latest = checkpoints[-1]
    print(f"Resuming from {latest}")
    state = torch.load(latest, map_location=device)
    model.load_state_dict(state["model"])
    optimizer.load_state_dict(state["optimizer"])
    scheduler.load_state_dict(state["scheduler"])
    start_epoch = state["epoch"] + 1

# ── Training loop ──────────────────────────────────────────────────────────────
for epoch in range(start_epoch, EPOCHS):
    model.train()
    total_loss = 0.0
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        loss = criterion(model(inputs), targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(train_loader)

    # Validation
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            preds = model(inputs).argmax(dim=1)
            correct += (preds == targets).sum().item()
            total += targets.size(0)
    val_acc = correct / total

    scheduler.step()

    # Save checkpoint
    ckpt_path = os.path.join(CKPT_DIR, f"epoch_{epoch:02d}.pt")
    torch.save({
        "epoch": epoch,
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scheduler": scheduler.state_dict(),
    }, ckpt_path)

    print(
        f"Epoch {epoch+1}/{EPOCHS} | loss: {avg_loss:.4f} | val_acc: {val_acc:.4f} | checkpoint saved")

print("Training complete.")

The script:

  • Downloads CIFAR-10 to /project_ghent/data/ on first run (subsequent jobs reuse the cached download).

  • Saves a checkpoint after every epoch to /project_ghent/checkpoints/.

  • Resumes automatically from the latest checkpoint if one exists, so a job restart does not lose progress.

  • Logs one line per epoch to stdout, which the Slices AI infrastructure captures and makes available via slices ai output.

Important

Everything saved outside a mounted storage path (e.g. outside /project_ghent) is lost when the job ends. Always write checkpoints, logs, and results to the mounted project storage.

Step 4 — Install and configure the CLI

The CLI is the primary tool for submitting and managing batch jobs.

Install

pip install slices-cli-ai --extra-index-url=https://doc.slices-ri.eu/pypi/

Verify the installation:

❯ slices ai --help

Log in

Authenticate with the Slices portal. This opens a browser window where you complete the OAuth flow:

❯ slices auth login
Please go to https://portal.slices-ri.eu/oauth/authorize_device?user_code=FFWK-BJTB and login to continue.
Successfully logged in as 'your_username'
💾 Saved to /home/your_username/.slices/auth.json

Your token is saved locally and reused for subsequent commands. You can verify you are logged in at any time with:

❯ slices auth show
Logged in as 'your_username'

Select your project

All CLI commands operate within the currently selected project. Set it once and it persists across sessions:

❯ slices project use my_phd_project
The current project was set to 'my_phd_project'.

To see which projects you belong to, run slices project list. To check the currently active project, run slices project show. You can switch projects at any time by running slices project use again.

See the full CLI reference for all commands.

Step 5 — Write the job definition

A job definition is a JSON file that tells the Slices AI infrastructure which image to use, which resources to allocate, which storage to mount, and what command to run.

Save the following as cifar10-job.json on your laptop:

cifar10-job.json
{
    "name": "cifar10-resnet18",
    "description": "ResNet-18 on CIFAR-10 — Slices AI infrastructure tutorial example",
    "request": {
        "resources": {
            "gpus": 1,
            "cpus": 4,
            "cpuMemoryGb": 16,
            "gpuModel": [
                "1080Ti"
            ]
        },
        "docker": {
            "image": "gitlab.ilabt.imec.be:4567/ilabt/gpu-docker-stacks/pytorch-notebook:cuda12.6-latest",
            "command": "bash -c 'python /project_ghent/cifar10/train.py 2>&1 | tee /project_ghent/logs/cifar10-${GPULAB_JOB_ID}.log'",
            "storage": [
                {
                    "hostPath": "/project_ghent",
                    "containerPath": "/project_ghent"
                }
            ]
        }
    }
}

Key fields to understand:

  • image: the same PyTorch image you used in JupyterHub, so your script runs in an identical environment.

  • command: runs train.py and redirects output to a log file on project storage in addition to stdout (which slices ai output reads).

  • storage: mounts /project_ghent inside the container — where the script reads data and writes checkpoints and logs.

  • gpus / cpus / cpuMemoryGb: request what you need. CIFAR-10 training fits in 1 GPU and 16 GB RAM.

For a complete description of every field, see the Job Definition reference.

Fair use — a quick reminder before you submit

The Slices AI infrastructure is a shared resource. Please keep these ground rules in mind:

  • Only request what you need. Over-requesting GPUs or memory delays other users’ jobs.

  • Start computation automatically. Your job command must start the actual computation without any manual interaction — the infrastructure runs it unattended the moment resources are available.

  • Let your job exit when it is done. Write your job command so that it terminates naturally once computation finishes. Do not leave jobs running idle — this blocks resources for others.

  • Use checkpoints for long runs. Hardware failures happen. The tutorial script already does this; keep the pattern in your own code.

  • Split large experiments into smaller jobs where possible: shorter jobs queue and complete faster.

Full details: Fair Use Policy.

Step 6 — Submit the job and monitor it

Submit

❯ slices ai submit cifar10-job.json
✨ Created Job a1b2c3d4-e5f6-7890-abcd-ef1234567890

The returned UUID is your job ID. You only need to type enough characters to uniquely identify it (e.g. a1b2c3d4).

Check status

Via the CLI:

❯ slices ai show a1b2c3d4
      Job ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
        Name: cifar10-resnet18
      Status: RUNNING
Docker image: gitlab.ilabt.imec.be:4567/ilabt/gpu-docker-stacks/pytorch-notebook:cuda12.6-latest

Via the website: open https://ai.slices-ri.eu and find your job tile. The current status is shown directly on the tile.

A job passes through the following states: QUEUEDASSIGNEDSTARTINGRUNNINGFINISHED. If resources are busy, the job waits in QUEUED until a slot is available.

Watch live output

Via the CLI:

❯ slices ai output a1b2c3d4
Epoch 1/10 | loss: 1.8432 | val_acc: 0.3521 | checkpoint saved
Epoch 2/10 | loss: 1.5201 | val_acc: 0.4318 | checkpoint saved
...

Via the website: click the job tile to open the job detail view. The Output tab shows the captured stdout of your job, updated live as the job runs.

Monitor resource usage

Via the website: click the job tile to open the job detail view. The Usage Graphs tab shows GPU, CPU, and memory utilisation over time. Aim for GPU utilisation above 80% for efficient use of shared resources. After the job finishes, the General Info tab shows aggregated CPU, GPU, and memory statistics for the full run.

Job tile on the Slices AI infrastructure website

Console access

Via the website: click the job tile to open the job detail view. The Console tab gives you a terminal directly inside the running container. This is useful for debugging, inspecting logs, or manually running commands.

Job tile on the Slices AI infrastructure website

Via the CLI: you can SSH into the running container with slices ai ssh:

❯ slices ai ssh a1b2c3d4
root@267721d7f116:~# nvidia-smi
Fri May  8 11:55:22 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     On  |   00000000:86:00.0 Off |                  N/A |
|  0%   27C    P8              8W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Retrieve results

After the job finishes, your checkpoints and log file are already in /project_ghent. Open JupyterHub, spawn a session with /project_ghent mounted, and copy the files you need from /project_ghent/checkpoints/ and /project_ghent/logs/ via the file browser.

Tip

Once you have copied what you need, delete the data/ and checkpoints/ folders to free up space on the shared /project_ghent storage:

rm -rf /project_ghent/data /project_ghent/checkpoints

Step 7 — When something goes wrong

Job stays ``QUEUED`` — the requested resources are not currently available. Check cluster availability at https://ai.slices-ri.eu/live/cluster. You can relax the clusterId or gpuModel constraints, or wait. If you are on a deadline, consider requesting a reservation.

Job goes to ``FAILED`` — check output first:

❯ slices ai output a1b2c3d4

Then check the internal Slices AI infrastructure event log for infrastructure-level errors:

❯ slices ai debug a1b2c3d4

Next steps