IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

1National University of Science and Technology Politehnica Bucharest, Romania
Representative sample frames from all IndoorCrowd scenes

Representative raw frames from all four IndoorCrowd scenes, illustrating variation in viewpoint, crowd density, and occlusion patterns.

Abstract

Understanding human behaviour in crowded indoor environments is a core challenge for surveillance, smart-building applications, and human-robot interaction — all of which depend on the ability to reliably detect, segment, and track individuals within complex, dynamic social contexts. Yet existing datasets rarely capture the diversity and difficulty of real-world indoor scenes at scale.

We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four distinct indoor public locations within a campus (ACS-EC, ACS-EG, IE-Central, R-Central). The dataset comprises 31 videos sampled at 5 fps, totalling 9,913 frames with human-verified, manually corrected per-instance segmentation masks.

To characterise the quality of foundation-model-based auto-labelling in crowded indoor scenes, we establish a control subset of 620 frames annotated entirely by humans and benchmark three automatic annotators — SAM3, GroundingSAM, and EfficientGroundingSAM — against this ground truth using Cohen's κ, AP@0.5, AP@0.75, precision, recall, and mask IoU per scene. An additional 2,552 frames form a multi-object tracking subset with human-verified continuous identity tracks in MOTChallenge format. We further establish detection, segmentation, and tracking baselines by pairing YOLOv8n, YOLOv26n, and RT-DETR-L with ByteTrack, BoT-SORT, and OC-SORT across all four scenes.

Per-scene analysis reveals substantial variation in difficulty driven by crowd density, instance scale, and occlusion: ACS-EC, with 79.3% dense frames and a mean instance scale of 60.8 px, is the most challenging scene in the dataset.

Dataset

Hugging Face View on Hugging Face

9,913

Annotated frames

620

Human-annotated control frames

2,552

MOT subset frames

4

Indoor scenes

Per-Scene Crowd Statistics

Scene Instances Persons/frame (mean ± std) Sparse (%) Medium (%) Dense (%) Occlusion rate
ACS-EC 2,128 12.23 ± 3.80 2.9% 17.8% 79.3% dense 27.5%
ACS-EG 1,260 5.41 ± 1.94 18.5% 81.5% 0.0% 38.3%
IE-Central 1,083 7.96 ± 3.63 0.0% 76.5% 23.5% 31.9%
R-Central 391 6.86 ± 1.52 1.8% 98.2% 0.0% 26.9%
Overall 4,862 8.10 ± 4.18 5.8% 62.0% 32.2% 30.3%

Density bins: sparse ≤ 3, medium 4–10, dense > 10 persons/frame. Occlusion estimated via bounding-box overlap (IoU > 0.1).

Per-Scene Instance Scale

Scene Rel. scale (mean ± std) Abs. scale px (mean ± std) AR (mean ± std) Small (%) Medium (%) Large (%)
ACS-EC 0.063 ± 0.039 60.8 ± 37.7 2.02 ± 0.89 17.2% 73.5% 9.4%
ACS-EG 0.141 ± 0.081 135.6 ± 78.2 3.28 ± 1.03 1.0% 40.6% 58.4%
IE-Central 0.095 ± 0.040 91.2 ± 38.6 2.48 ± 0.66 3.3% 49.8% 46.9%
R-Central 0.062 ± 0.017 59.1 ± 16.4 2.26 ± 0.73 4.9% 93.6% 1.5%

COCO size bins: small < 32 px², medium 32²–96² px², large > 96² px².

Spatial Density Heatmaps

Spatial density heatmap for ACS-EC

ACS-EC — dense multi-level atrium with the highest proportion of dense frames.

Spatial density heatmap for ACS-EG

ACS-EG — corridor scene with strong near-to-distal scale variation.

Spatial density heatmap for IE-Central

IE-Central — entrance hall with wide variation in per-frame crowd counts.

Spatial density heatmap for R-Central

R-Central — overhead atrium where structural columns create regular partial occlusions.

Annotation Pipeline

We employ a human-in-the-loop pipeline that combines high-recall foundation-model auto-labelling with targeted human correction, enabling efficient scaling across 9,913 frames while preserving ground-truth fidelity.

1
Video Recording
Fixed webcam at 1280 × 720 px, 25 fps. Recordings span afternoon and evening hours on multiple days to capture natural variation in crowd density and lighting.
2
Frame Sampling at 5 fps
Downsampling to 5 fps ensures consecutive frames are sufficiently distinct while maintaining compatibility with multi-sensor platforms. 31 videos yield 9,913 frames.
3
Foundation-Model Auto-Labelling
Three auto-annotators generate candidate masks and bounding boxes: SAM3 (high recall, text-prompted), GroundingSAM (language-grounded, higher precision), and EfficientGroundingSAM (comparable quality at lower inference cost).
4
Human Review & Correction via SAM 2.1
Every frame undergoes manual verification. Annotators add missing masks, correct imprecise boundaries via direct polygon labelling, and remove false positives using the interactive Gradio review UI.
5
Human-Annotated Control Subset (620 frames)
20 frames per video are annotated entirely from scratch — without any auto-label priors — to serve as ground truth for benchmarking the auto-labellers.
6
MOT Subset Curation (2,552 frames)
Initial tracklets from SAM3 detections are reviewed by humans: identity switches are corrected, fragmented tracklets merged, ghost tracks removed, and missing detections linearly interpolated. Output follows the MOTChallenge format.

Auto-Labelling Quality

Evaluated against the 620-frame human ground truth. Metrics: AP@0.5, AP@0.75, Precision, Recall, Mask IoU, Cohen's κ.

Qualitative comparison of SAM3, GroundingSAM, and human GT

Columns: raw image | SAM3 | GroundingSAM | human ground truth. SAM3 over-predicts in dense scenes (row 1); GroundingSAM misses occluded persons.

Scene Method AP@0.5 ↑ AP@0.75 ↑ Prec. ↑ Rec. ↑ Mask IoU ↑ Cohen's κ ↑
ACS-EC SAM3 0.783 0.599 0.522 0.882 0.802 0.849
GroundingSAM 0.435 0.324 0.831 0.474 0.831 0.762
Efficient G-SAM 0.433 0.323 0.829 0.472 0.803 0.758
ACS-EG SAM3 0.963 0.862 0.749 0.976 0.859 0.926
GroundingSAM 0.922 0.824 0.870 0.939 0.869 0.931
Efficient G-SAM 0.924 0.826 0.872 0.941 0.850 0.934
IE-Central SAM3 0.928 0.788 0.829 0.946 0.836 0.923
GroundingSAM 0.803 0.670 0.929 0.817 0.835 0.902
Efficient G-SAM 0.804 0.672 0.929 0.818 0.809 0.904
R-Central SAM3 0.946 0.832 0.652 0.982 0.847 0.864
GroundingSAM 0.901 0.774 0.670 0.944 0.853 0.859
Efficient G-SAM 0.900 0.774 0.669 0.941 0.830 0.849
Key finding: SAM3 achieves the highest recall (0.88–0.98) at the cost of low precision in dense scenes (0.52 on ACS-EC), making it the optimal starting point for human correction. GroundingSAM and EfficientGroundingSAM offer complementary higher precision. All methods degrade most on ACS-EC due to its denser frames, smaller instances (mean 60.8 px), and higher occlusion.
Auto-labeller AP@0.5 vs crowd density

AP@0.5 per method by crowd density bin (sparse, medium, dense).

Distribution of annotations per image for each auto-labeller

Distribution of annotations per image for SAM3, GroundingSAM, and EfficientGroundingSAM.

Baselines

Detection & Segmentation

All models trained on ACS-EC + ACS-EG (COCO-pretrained, 30 epochs, batch 16, 640×640 input) and evaluated on the held-out scenes IE-Central + R-Central. Latency on NVIDIA RTX 4060Ti (16 GB, batch 1).

Model Task Box mAP@0.5 ↑ Box mAP@0.50:95 ↑ Mask mAP@0.5 ↑ Mask mAP@0.50:95 ↑ Latency (ms) ↓ Size (MB) ↓
YOLOv8n detect 0.864 0.616 2.28 6.22
YOLOv26n detect 0.796 0.560 2.81 5.36
RT-DETR-L detect 0.911 0.704 27.36 66.21
YOLOv8n-seg segment 0.864 0.620 0.833 0.541 1.89 6.76
YOLOv26n-seg segment 0.808 0.572 0.787 0.512 2.61 6.51

Multi-Object Tracking (Overall)

Six detector–tracker combinations evaluated across all four scenes in MOTChallenge format.

Detector Tracker MOTA ↑ IDF1 ↑ MT% ↑ ML% ↓ IDS ↓ FPS ↑
YOLOv8n ByteTrack 48.5 66.9 49.0 35.4 161 108.2
YOLOv8n BoT-SORT 51.5 68.6 52.9 34.6 143 118.5
YOLOv8n OC-SORT 51.4 68.6 45.3 35.8 138 119.5
RT-DETR-L ByteTrack 51.4 69.3 55.5 27.2 172 31.4
RT-DETR-L BoT-SORT 55.5 71.6 60.6 26.2 161 32.0
RT-DETR-L OC-SORT 56.2 71.8 54.7 27.4 166 32.7
Key findings: RT-DETR-L + OC-SORT achieves the best overall MOTA (56.2). BoT-SORT yields the fewest identity switches, indicating better identity preservation. YOLOv8n + ByteTrack exceeds 108 FPS while remaining competitive on MOTA — the recommended choice for real-time deployment. ACS-EC is the hardest tracking scene (MOTA peaks at 40.2 even with RT-DETR-L).

Related Datasets

Existing pedestrian detection benchmarks — CrowdHuman, WiderPerson, CityPersons — are predominantly outdoor. The MOTChallenge series (MOT17, MOT20) has been the primary driver of tracking progress but lacks instance masks and focuses on unconstrained or event-based environments rather than fixed indoor surveillance.

JTA provides large-scale synthetic indoor/outdoor tracking, while JRDB-PanoTrack targets robot-centric panoramic views. IndoorCrowd fills the niche of real-world, fixed-camera indoor scenes with all three annotation types (bounding boxes, instance masks, MOT tracks) across diverse crowd conditions.

BibTeX

@inproceedings{indoorcrowd2026,
  author    = {Nae, Sebastian-Ion and Moldoveanu, Radu and Ghita, Alexandra Stefania and Florea, Adina Magda},
  title     = {IndoorCrowd: A Multi-Scene Dataset for Human Detection,
               Segmentation, and Tracking with an Automated Annotation Pipeline},
  booktitle = {5th DataCV Workshop and Challenge, in conjunction with
               IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}