Skip to content

OSL JSON Format

This page describes the OSL-style JSON files loaded, edited, and written by the Video Annotation Tool.

An OSL JSON file is a single JSON object with dataset metadata, a label schema, and a data array of samples. Each sample points to one or more media inputs and can carry task-specific annotations.

Minimal Valid File

This is the smallest practical shape for a dataset with one video sample:

{
  "version": "2.0",
  "date": "2026-05-19",
  "dataset_name": "minimal-demo",
  "description": "",
  "modalities": ["video"],
  "metadata": {},
  "labels": {},
  "data": [
    {
      "id": "clip_0001",
      "inputs": [
        {
          "type": "video",
          "path": "clips/clip_0001.mp4"
        }
      ]
    }
  ]
}

Relative paths

Relative inputs[].path values are resolved from the folder that contains the JSON file. If you move the JSON without moving its media folders, playback can fail.

Common Mistakes

Mistake Result Fix
Root JSON is an array The app rejects the file. Use one root object with a data array.
data is missing or not a list The app rejects the file. Set data to [] or a list of sample objects.
Using top-level questions for Q/A Legacy question banks are dropped on save. Store Q/A in each sample's grouped answers[].
Dense captions use start_ms/end_ms only The current dense editor expects point timestamps. Use dense_captions[].position_ms.
Annotation head names do not match root labels Controls may not show the expected labels. Keep data[].labels keys and events[].head values aligned with root labels.
Relative media paths no longer point to files Samples load but playback cannot find media. Keep media beside the JSON or resave after correcting paths.

Top-Level Object

The smallest useful file is a JSON object with data as a list. When loading, the app fills missing standard fields with defaults. When saving, it writes the standard project fields back out.

Field Type Notes
version string Current app default is "2.0".
date string Usually an ISO date such as "2026-05-19".
dataset_name string Human-readable project name.
description string Free-text dataset description. Empty string is allowed.
modalities array Input types present in the dataset, for example ["video"]. The app recomputes this from sample inputs on save.
metadata object Dataset-level custom metadata.
labels object Label schema shared by classification and localization heads.
data array Sample list. This must be a list.

Unknown root keys are preserved, except retired legacy keys documented below.

Label Schema

The root labels object defines annotation heads. Each head name is a key, and each definition should include:

  • type: single_label or multi_label.
  • labels: list of allowed label strings.
{
  "labels": {
    "action": {
      "type": "single_label",
      "labels": ["pass", "shot", "foul"]
    },
    "attributes": {
      "type": "multi_label",
      "labels": ["left_foot", "header", "set_piece"]
    }
  }
}

Classification and localization annotations should reference these same head names. For example, data[].labels.action and data[].events[].head == "action" both point at the root labels.action schema.

Sample Objects

Each entry in data is one sample.

Field Type Notes
id string Stable sample ID. Missing or duplicate IDs are normalized on load/save. Duplicates receive suffixes such as __2.
inputs array Media or feature files for this sample. Multi-view samples use multiple input entries.
metadata object Optional sample-level metadata. Empty metadata is removed on save.
labels object Classification payload for this sample.
events array Timestamped localization events.
captions array Clip-level description captions.
dense_captions array Timestamped dense descriptions.
answers array Grouped question/answer annotations.

Unknown sample keys are preserved.

Input Objects

Each sample should include inputs, even if the sample has only one media file.

{
  "inputs": [
    {
      "type": "video",
      "path": "clips/clip_0001.mp4",
      "fps": 25.0
    }
  ]
}

Supported input types:

Type Typical path Notes
video clips/clip_0001.mp4 Default when type is missing and the extension is not special.
frames_npy frames/clip_0001.npy Uses fps for playback timing. The legacy alias frame_npy is normalized to frames_npy.
tracking_parquet tracking/clip_0001.parquet Uses parquet timestamps when available. Optional fps is a fallback.

Input paths can be relative or absolute when loading. On save, input paths are rewritten relative to the saved JSON file location when possible.

Multi-view samples use more than one input:

{
  "id": "play_0001",
  "inputs": [
    {"type": "video", "path": "wide/play_0001.mp4", "fps": 25.0},
    {"type": "video", "path": "close/play_0001.mp4", "fps": 25.0}
  ]
}

Task Payloads

Classification

Sample-level labels uses the same head names defined at the root.

{
  "labels": {
    "action": {
      "label": "shot"
    },
    "attributes": {
      "labels": ["left_foot", "set_piece"]
    }
  }
}

For smart predictions, a head payload may include confidence_score as a float from 0.0 to 1.0:

{
  "labels": {
    "action": {
      "label": "shot",
      "confidence_score": 0.91
    }
  }
}

Confirming a smart prediction removes only confidence_score; the chosen label stays as the manual annotation.

Localization

Localization annotations live in events. Each event is a point timestamp in milliseconds.

{
  "events": [
    {
      "head": "action",
      "label": "pass",
      "position_ms": 1240
    },
    {
      "head": "action",
      "label": "shot",
      "position_ms": 4320,
      "confidence_score": 0.84
    }
  ]
}

head should match a root label head. Smart localization predictions use the same optional confidence_score convention as classification.

Description

Description annotations live in captions. The app writes one English caption for manual description edits, but additional caption fields are preserved.

{
  "captions": [
    {
      "lang": "en",
      "text": "A player receives the pass and shoots from the edge of the box."
    }
  ]
}

Dense Description

Dense description annotations live in dense_captions. The current dense editor uses point timestamps in milliseconds.

{
  "dense_captions": [
    {
      "position_ms": 1200,
      "lang": "en",
      "text": "The midfielder receives the ball."
    },
    {
      "position_ms": 4300,
      "lang": "en",
      "text": "The forward takes a shot."
    }
  ]
}

Question/Answer

Q/A annotations live in grouped per-sample answers. Each group stores the question text and one or more non-empty answers.

{
  "answers": [
    {
      "question": "What happens after the pass?",
      "answers": ["The receiving player shoots."]
    }
  ]
}

Legacy top-level questions and per-answer question_id entries are not persisted. Convert old VQA files with tools/convert_legacy_vqa_to_grouped.py.

Complete Examples

Classification JSON

{
  "version": "2.0",
  "date": "2026-05-19",
  "dataset_name": "soccer-classification-demo",
  "description": "Clip-level action labels.",
  "modalities": ["video"],
  "metadata": {
    "sport": "soccer",
    "split": "train"
  },
  "labels": {
    "action": {
      "type": "single_label",
      "labels": ["pass", "shot", "foul"]
    },
    "attributes": {
      "type": "multi_label",
      "labels": ["left_foot", "header", "set_piece"]
    }
  },
  "data": [
    {
      "id": "clip_0001",
      "inputs": [
        {
          "type": "video",
          "path": "clips/clip_0001.mp4",
          "fps": 25.0
        }
      ],
      "labels": {
        "action": {
          "label": "shot"
        },
        "attributes": {
          "labels": ["left_foot"]
        }
      },
      "metadata": {
        "match_id": "match_01"
      }
    }
  ]
}

Localization and Dense Description JSON

{
  "version": "2.0",
  "date": "2026-05-19",
  "dataset_name": "soccer-timeline-demo",
  "description": "Timestamped events and dense captions.",
  "modalities": ["video"],
  "metadata": {},
  "labels": {
    "action": {
      "type": "single_label",
      "labels": ["pass", "shot", "save"]
    }
  },
  "data": [
    {
      "id": "attack_0001",
      "inputs": [
        {
          "type": "video",
          "path": "clips/attack_0001.mp4",
          "fps": 25.0
        }
      ],
      "events": [
        {
          "head": "action",
          "label": "pass",
          "position_ms": 1100
        },
        {
          "head": "action",
          "label": "shot",
          "position_ms": 3650
        }
      ],
      "captions": [
        {
          "lang": "en",
          "text": "A quick attack ends with a shot on goal."
        }
      ],
      "dense_captions": [
        {
          "position_ms": 1100,
          "lang": "en",
          "text": "The midfielder plays a forward pass."
        },
        {
          "position_ms": 3650,
          "lang": "en",
          "text": "The striker shoots from inside the area."
        }
      ]
    }
  ]
}

Multi-Input Q/A JSON

{
  "version": "2.0",
  "date": "2026-05-19",
  "dataset_name": "multi-view-qa-demo",
  "description": "Two synchronized views with question/answer labels.",
  "modalities": ["video"],
  "metadata": {
    "sport": "basketball"
  },
  "labels": {},
  "data": [
    {
      "id": "possession_0001",
      "inputs": [
        {
          "type": "video",
          "path": "broadcast/possession_0001.mp4",
          "fps": 30.0
        },
        {
          "type": "video",
          "path": "baseline/possession_0001.mp4",
          "fps": 30.0
        }
      ],
      "answers": [
        {
          "question": "Which team ends the possession?",
          "answers": ["The home team."]
        },
        {
          "question": "How does the possession end?",
          "answers": ["A made three-point shot."]
        }
      ]
    }
  ]
}

Save-Time Behavior

On save/export, the app:

  • Ensures unique sample IDs.
  • Normalizes input types, including frame_npy to frames_npy.
  • Rewrites input paths relative to the output JSON location when possible.
  • Recomputes modalities from data[].inputs[].
  • Removes empty optional sample fields such as labels, events, captions, dense_captions, answers, and metadata.
  • Normalizes Q/A answers to grouped {"question": ..., "answers": [...]} entries with non-empty text.
  • Drops legacy top-level questions and question_id answer entries.
  • Drops retired sample smart keys such as smart_labels and smart_events.
  • Does not persist localization label_colors; label colors live in app settings.
  • Preserves unknown root and sample fields where possible.