Get a large dataset into a job

The recommended approach for large datasets (anything from a few GB upwards) is S3 object storage: upload once from your laptop, pull inside every job that needs it. S3 is mirrored across all SLICES sites so the same bucket is accessible from Ghent, Antwerp, and Madrid clusters without any extra steps.

This page covers:

  1. Getting access to S3

  2. Uploading from your laptop

  3. Downloading inside a job

For the full S3 reference (quota, versioning, optimise locations), see the S3 storage documentation.

Step 1 — Create a bucket and access keys

  1. Find your bucket name on the SLICES Portal project page. It follows the pattern ilabt.imec.be-project-<your-project-name>.

  2. Go to the S3 web console and log in with your SLICES account (choose Use Slices-RI account - iam).

  3. If your bucket is not listed under Object Browser, click Buckets → Create Bucket and enter the exact name from step 1.

  4. Click Access Keys → Create access key. Save the Access Key and Secret Key that are shown — the secret cannot be retrieved later.

Warning

Store your access key and secret key in environment variables or a secrets manager, not hard-coded in scripts. Any script committed to a public repository with embedded credentials should be treated as compromised.

Step 2 — Upload from your laptop

Choose the method that fits your workflow.

Web console (small uploads, exploration)

Use the S3 web console Object Browser to drag-and-drop files or create folders interactively. Practical for files up to a few hundred MB; slow for large datasets.

rclone (alternative with sync semantics)

rclone is popular for keeping a local directory in sync with remote storage. Install from https://rclone.org/install/, then configure:

❯ rclone config

Choose New remote → S3 → Other, set the endpoint to s3.slices-be.eu, and enter your access key and secret key. Name the remote slices.

Upload:

❯ rclone copy ./imagenet/ slices:<bucket-name>/datasets/imagenet/ --progress

Use rclone sync instead of rclone copy to delete remote files that no longer exist locally.

Step 3 — Access the dataset inside a job

Within a SLICES AI job, use the internal S3 endpoint https://s3.sliceslocal:9060. It routes within the infrastructure and gives lower latency and higher throughput than the public endpoint.

Note

The internal endpoint uses a self-signed TLS certificate. See the examples below for how to handle this. The internal endpoint only works from within a running job on the SLICES AI infrastructure — use https://s3.slices-be.eu from your laptop.

Recommended approach: Use a framework-specific library if available (e.g. s3torchconnector for PyTorch). They handle S3 connectivity automatically and integrate seamlessly with your training loop.

Framework-agnostic approach: Use boto3 to download the dataset to fast local storage at job start, then train from the local copy.

Alternative: Use webdataset for streaming access — see the pros/cons below.

Approach 1 — s3fs

s3fs lets you stream data directly from S3 without downloading to local storage first.

Add to your Dockerfile or job command:

pip install --no-cache-dir s3fs

In your training script:

import os
import s3fs

BUCKET       = "s3://ilabt.imec.be-project-<your-project-name>"
DATASET_PATH = BUCKET + "/datasets/imagenet/"
ENDPOINT     = "https://s3.sliceslocal:9060"

 # Initialize s3fs FileSystem
 # Note: s3fs uses 'client_kwargs' to pass Boto3-specific parameters
 fs = s3fs.S3FileSystem(
     key=os.environ["S3_ACCESS_KEY"],
     secret=os.environ["S3_SECRET_KEY"],
     endpoint_url=ENDPOINT,
     use_ssl=True,
     client_kwargs={'verify': False }, # allow self-signed SSL cert on s3.sliceslocal
     config_kwargs={"signature_version": "s3v4"}
 )
 remote_path = f"{BUCKET}/{DATASET_KEY}"

 # If you want to iterate over the contents of the bucket (e.g. list files, check existence), use the s3fs API:
 for file_path in fs.ls(remote_path, detail=False):
     print(file_path)
     with fs.open(file_path, 'rb') as f:
         # do something with the file-like object `f`
         print(f.read(100))  # read first 100 bytes

Advantages: - No initial download delay; training starts immediately. - Saves local storage; useful for datasets larger than /project_scratch.

Disadvantages: - Network I/O contention with training (can slow training if not enough bandwidth). - Not suitable for datasets with many small files; S3 latency can add up. Consider using webdataset. - Less predictable iteration time (network jitter affects batch loading).

Approach 2 — Download to local storage first

Pull the entire dataset into fast local storage at job start, then train from the local copy. Works with any framework.

Install boto3 in your Docker image (add to your Dockerfile, see Build and use a custom Docker image), or install at job start via the job’s command:

pip install boto3 --quiet && python train.py

In your training script:

import os
import boto3
from botocore.client import Config
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

BUCKET       = "ilabt.imec.be-project-<your-project-name>"
DATASET_KEY  = "datasets/imagenet/"          # prefix (folder) inside the bucket
LOCAL_DIR    = "/project_scratch/imagenet/"  # fast local SSD; or /project_ghent/
ENDPOINT     = "https://s3.sliceslocal:9060" # internal endpoint — fastest from a job

s3 = boto3.resource(
    "s3",
    endpoint_url=ENDPOINT,
    aws_access_key_id=os.environ["S3_ACCESS_KEY"],
    aws_secret_access_key=os.environ["S3_SECRET_KEY"],
    verify=False,  # self-signed cert on the internal endpoint
    config=Config(signature_version="s3v4"),
)

bucket = s3.Bucket(BUCKET)
os.makedirs(LOCAL_DIR, exist_ok=True)

print(f"Downloading dataset from s3://{BUCKET}/{DATASET_KEY} ...")
for obj in bucket.objects.filter(Prefix=DATASET_KEY):
    dest = os.path.join(LOCAL_DIR, os.path.relpath(obj.key, DATASET_KEY))
    os.makedirs(os.path.dirname(dest), exist_ok=True)
    bucket.download_file(obj.key, dest)

print("Download complete. Starting training...")
# ... your training code here with LOCAL_DIR

Advantages: - Predictable iteration time; all data on fast local SSD. - Works with any ML framework. - Simple to debug (data visible on disk).

Disadvantages: - Initial download delay (minutes to hours for very large datasets). - Requires enough local storage (/project_scratch capacity).

Tip

For very large datasets, downloading once to /project_scratch and keeping it there across jobs avoids re-downloading every run. /project_scratch is persistent between jobs on the same slave node — see Storage reference.

For datasets that change often, or when you cannot guarantee the same node, download to /project_scratch at the start of each job: the local SSD is fast enough that even a few hundred GB completes in minutes.

Approach 3 — webdataset for streaming tar archives

webdataset streams data from S3 tar archives, avoiding both the download delay and the need for local storage. Popular for computer vision datasets stored as sharded tarballs.

By bundling many samples into a single archive, webdataset optimises for sequential reads and high throughput. Ideal for large datasets that don’t fit on local storage and can be read sequentially.

Prerequisites

Your dataset must be packed into sharded tar archives. See the webdataset documentation on creating datasets for how to convert a regular dataset (e.g., ImageNet folder structure) into the webdataset tar format. Once prepared, upload the tar files to your S3 bucket at a location like s3://ilabt.imec.be-project-<your-project-name>/datasets/imagenet-sharded/.

Add to your Dockerfile or job command:

pip install fsspec webdataset --quiet && python train.py

Example:

import os

import fsspec
import webdataset as wds
from torch.utils.data import DataLoader

BUCKET       = "ilabt.imec.be-project-<your-project-name>"
S3_URL       = "s3://s3.sliceslocal:9060/"
TARBALL_PATH = S3_URL + BUCKET + "/datasets/imagenet-sharded/{000000..000099}.tar"
 urls = f"s3://{BUCKET}/{DATASET_KEY}/imagenet-train-{{000000..000099}}.tar"

 def fsspec_opener(data):
     """Opens a URL using fsspec and returns the file-like object."""
     for sample in data:
         with fsspec.open(
         sample["url"],
         mode='rb',
         key=os.environ['S3_ACCESS_KEY'],
         secret=os.environ['S3_SECRET_KEY'],
             client_kwargs={
                 "verify": False,           # Equivalent to verify=False
                 "endpoint_url": ENDPOINT,
             },
             config_kwargs={
                 "signature_version": "s3v4"
             }
         ).open() as f:
             yield f

 dataset = wds.DataPipeline(
     wds.SimpleShardList(urls),
     wds.split_by_node,    # Required for DistributedDataParallel (Multi-GPU)
     fsspec_opener,
     wds.tarfile_to_samples,
     wds.decode(...),
     wds.to_tuple(...)
 )

 dataloader = DataLoader(dataset, batch_size=4, num_workers=2)

 # Use the PyTorch dataloader in your training loop as usual

Advantages: - No initial download or local storage needed. - Fast iteration for sharded datasets; streaming is optimized. - Minimal latency once streaming starts.

Disadvantages: - Requires data to be pre-packaged as tar archives (webdataset format). - Network I/O contention with training (similar to s3torchconnector). - Iteration time less predictable (network jitter). - Not suitable for random-access datasets; must iterate sequentially.

Pass credentials via environment variables

Set S3_ACCESS_KEY and S3_SECRET_KEY in the environment field of your job definition so they reach the running container without being stored in your script:

{
  "request": {
    "docker": {
      "image": "...",
      "command": "python /project_ghent/train.py",
      "environment": {
        "S3_ACCESS_KEY": "<your-access-key>",
        "S3_SECRET_KEY": "<your-secret-key>"
      },
      "storage": [
        { "containerPath": "/project_ghent" },
        { "containerPath": "/project_scratch" }
      ]
    }
  }
}

See Environment variables in the Job Definition reference for details on the environment field.