IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

Sebastian-Ion Nae¹, Radu Moldoveanu¹, Alexandra Stefania Ghita¹, Adina Magda Florea¹

¹National University of Science and Technology Politehnica Bucharest, Romania

5th DataCV Workshop and Challenge @ CVPR 2026

Paper Code

Dataset

Representative sample frames from all IndoorCrowd scenes

Representative raw frames from all four IndoorCrowd scenes, illustrating variation in viewpoint, crowd density, and occlusion patterns.

ACS-EC — Dense multi-level atrium. 12.2 persons/frame on average; 79.3% dense frames.

ACS-EG — Narrow corridor, ground-level camera. Strong near-to-distal scale variance; 38.3% occlusion rate.

IE-Central — Entrance hall, elevated camera angle. Widest per-frame range: 4–17 persons; 23.5% dense frames.

R-Central — Central atrium, overhead viewpoint. Structural columns produce regular partial occlusion; most uniform density scene.

Abstract

Understanding human behaviour in crowded indoor environments is a core challenge for surveillance, smart-building applications, and human-robot interaction — all of which depend on the ability to reliably detect, segment, and track individuals within complex, dynamic social contexts. Yet existing datasets rarely capture the diversity and difficulty of real-world indoor scenes at scale.

We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four distinct indoor public locations within a campus (ACS-EC, ACS-EG, IE-Central, R-Central). The dataset comprises 31 videos sampled at 5 fps, totalling 9,913 frames with human-verified, manually corrected per-instance segmentation masks.

To characterise the quality of foundation-model-based auto-labelling in crowded indoor scenes, we establish a control subset of 620 frames annotated entirely by humans and benchmark three automatic annotators — SAM3, GroundingSAM, and EfficientGroundingSAM — against this ground truth using Cohen's κ, AP@0.5, AP@0.75, precision, recall, and mask IoU per scene. An additional 2,552 frames form a multi-object tracking subset with human-verified continuous identity tracks in MOTChallenge format. We further establish detection, segmentation, and tracking baselines by pairing YOLOv8n, YOLOv26n, and RT-DETR-L with ByteTrack, BoT-SORT, and OC-SORT across all four scenes.

Per-scene analysis reveals substantial variation in difficulty driven by crowd density, instance scale, and occlusion: ACS-EC, with 79.3% dense frames and a mean instance scale of 60.8 px, is the most challenging scene in the dataset.

Dataset

View on Hugging Face

9,913

Annotated frames

620

Human-annotated control frames

2,552

MOT subset frames

Indoor scenes

Per-Scene Crowd Statistics

Scene	Instances	Persons/frame (mean ± std)	Sparse (%)	Medium (%)	Dense (%)	Occlusion rate
ACS-EC	2,128	12.23 ± 3.80	2.9%	17.8%	79.3% dense	27.5%
ACS-EG	1,260	5.41 ± 1.94	18.5%	81.5%	0.0%	38.3%
IE-Central	1,083	7.96 ± 3.63	0.0%	76.5%	23.5%	31.9%
R-Central	391	6.86 ± 1.52	1.8%	98.2%	0.0%	26.9%
Overall	4,862	8.10 ± 4.18	5.8%	62.0%	32.2%	30.3%

Density bins: sparse ≤ 3, medium 4–10, dense > 10 persons/frame. Occlusion estimated via bounding-box overlap (IoU > 0.1).

Per-Scene Instance Scale

Scene	Rel. scale (mean ± std)	Abs. scale px (mean ± std)	AR (mean ± std)	Small (%)	Medium (%)	Large (%)
ACS-EC	0.063 ± 0.039	60.8 ± 37.7	2.02 ± 0.89	17.2%	73.5%	9.4%
ACS-EG	0.141 ± 0.081	135.6 ± 78.2	3.28 ± 1.03	1.0%	40.6%	58.4%
IE-Central	0.095 ± 0.040	91.2 ± 38.6	2.48 ± 0.66	3.3%	49.8%	46.9%
R-Central	0.062 ± 0.017	59.1 ± 16.4	2.26 ± 0.73	4.9%	93.6%	1.5%

COCO size bins: small < 32 px², medium 32²–96² px², large > 96² px².

Spatial Density Heatmaps

ACS-EC — dense multi-level atrium with the highest proportion of dense frames.

ACS-EG — corridor scene with strong near-to-distal scale variation.

IE-Central — entrance hall with wide variation in per-frame crowd counts.

R-Central — overhead atrium where structural columns create regular partial occlusions.

Annotation Pipeline

We employ a human-in-the-loop pipeline that combines high-recall foundation-model auto-labelling with targeted human correction, enabling efficient scaling across 9,913 frames while preserving ground-truth fidelity.

Video Recording
Fixed webcam at 1280 × 720 px, 25 fps. Recordings span afternoon and evening hours on multiple days to capture natural variation in crowd density and lighting.

Frame Sampling at 5 fps
Downsampling to 5 fps ensures consecutive frames are sufficiently distinct while maintaining compatibility with multi-sensor platforms. 31 videos yield 9,913 frames.

Foundation-Model Auto-Labelling
Three auto-annotators generate candidate masks and bounding boxes: SAM3 (high recall, text-prompted), GroundingSAM (language-grounded, higher precision), and EfficientGroundingSAM (comparable quality at lower inference cost).

Human Review & Correction via SAM 2.1
Every frame undergoes manual verification. Annotators add missing masks, correct imprecise boundaries via direct polygon labelling, and remove false positives using the interactive Gradio review UI.

Human-Annotated Control Subset (620 frames)
20 frames per video are annotated entirely from scratch — without any auto-label priors — to serve as ground truth for benchmarking the auto-labellers.

MOT Subset Curation (2,552 frames)
Initial tracklets from SAM3 detections are reviewed by humans: identity switches are corrected, fragmented tracklets merged, ghost tracks removed, and missing detections linearly interpolated. Output follows the MOTChallenge format.

Auto-Labelling Quality

Evaluated against the 620-frame human ground truth. Metrics: AP@0.5, AP@0.75, Precision, Recall, Mask IoU, Cohen's κ.

Qualitative comparison of SAM3, GroundingSAM, and human GT

Columns: raw image | SAM3 | GroundingSAM | human ground truth. SAM3 over-predicts in dense scenes (row 1); GroundingSAM misses occluded persons.

Scene	Method	AP@0.5 ↑	AP@0.75 ↑	Prec. ↑	Rec. ↑	Mask IoU ↑	Cohen's κ ↑
ACS-EC	SAM3	0.783	0.599	0.522	0.882	0.802	0.849
	GroundingSAM	0.435	0.324	0.831	0.474	0.831	0.762
	Efficient G-SAM	0.433	0.323	0.829	0.472	0.803	0.758
ACS-EG	SAM3	0.963	0.862	0.749	0.976	0.859	0.926
	GroundingSAM	0.922	0.824	0.870	0.939	0.869	0.931
	Efficient G-SAM	0.924	0.826	0.872	0.941	0.850	0.934
IE-Central	SAM3	0.928	0.788	0.829	0.946	0.836	0.923
	GroundingSAM	0.803	0.670	0.929	0.817	0.835	0.902
	Efficient G-SAM	0.804	0.672	0.929	0.818	0.809	0.904
R-Central	SAM3	0.946	0.832	0.652	0.982	0.847	0.864
	GroundingSAM	0.901	0.774	0.670	0.944	0.853	0.859
	Efficient G-SAM	0.900	0.774	0.669	0.941	0.830	0.849

Key finding: SAM3 achieves the highest recall (0.88–0.98) at the cost of low precision in dense scenes (0.52 on ACS-EC), making it the optimal starting point for human correction. GroundingSAM and EfficientGroundingSAM offer complementary higher precision. All methods degrade most on ACS-EC due to its denser frames, smaller instances (mean 60.8 px), and higher occlusion.

AP@0.5 per method by crowd density bin (sparse, medium, dense).

Distribution of annotations per image for each auto-labeller

Distribution of annotations per image for SAM3, GroundingSAM, and EfficientGroundingSAM.

Baselines

Detection & Segmentation

All models trained on ACS-EC + ACS-EG (COCO-pretrained, 30 epochs, batch 16, 640×640 input) and evaluated on the held-out scenes IE-Central + R-Central. Latency on NVIDIA RTX 4060Ti (16 GB, batch 1).

Model	Task	Box mAP@0.5 ↑	Box mAP@0.50:95 ↑	Mask mAP@0.5 ↑	Mask mAP@0.50:95 ↑	Latency (ms) ↓	Size (MB) ↓
YOLOv8n	detect	0.864	0.616	—	—	2.28	6.22
YOLOv26n	detect	0.796	0.560	—	—	2.81	5.36
RT-DETR-L	detect	0.911	0.704	—	—	27.36	66.21
YOLOv8n-seg	segment	0.864	0.620	0.833	0.541	1.89	6.76
YOLOv26n-seg	segment	0.808	0.572	0.787	0.512	2.61	6.51

Multi-Object Tracking (Overall)

Six detector–tracker combinations evaluated across all four scenes in MOTChallenge format.

Detector	Tracker	MOTA ↑	IDF1 ↑	MT% ↑	ML% ↓	IDS ↓	FPS ↑
YOLOv8n	ByteTrack	48.5	66.9	49.0	35.4	161	108.2
YOLOv8n	BoT-SORT	51.5	68.6	52.9	34.6	143	118.5
YOLOv8n	OC-SORT	51.4	68.6	45.3	35.8	138	119.5
RT-DETR-L	ByteTrack	51.4	69.3	55.5	27.2	172	31.4
RT-DETR-L	BoT-SORT	55.5	71.6	60.6	26.2	161	32.0
RT-DETR-L	OC-SORT	56.2	71.8	54.7	27.4	166	32.7

Key findings: RT-DETR-L + OC-SORT achieves the best overall MOTA (56.2). BoT-SORT yields the fewest identity switches, indicating better identity preservation. YOLOv8n + ByteTrack exceeds 108 FPS while remaining competitive on MOTA — the recommended choice for real-time deployment. ACS-EC is the hardest tracking scene (MOTA peaks at 40.2 even with RT-DETR-L).

Related Datasets

Existing pedestrian detection benchmarks — CrowdHuman, WiderPerson, CityPersons — are predominantly outdoor. The MOTChallenge series (MOT17, MOT20) has been the primary driver of tracking progress but lacks instance masks and focuses on unconstrained or event-based environments rather than fixed indoor surveillance.

JTA provides large-scale synthetic indoor/outdoor tracking, while JRDB-PanoTrack targets robot-centric panoramic views. IndoorCrowd fills the niche of real-world, fixed-camera indoor scenes with all three annotation types (bounding boxes, instance masks, MOT tracks) across diverse crowd conditions.

BibTeX

@inproceedings{indoorcrowd2026,
  author    = {Nae, Sebastian-Ion and Moldoveanu, Radu and Ghita, Alexandra Stefania and Florea, Adina Magda},
  title     = {IndoorCrowd: A Multi-Scene Dataset for Human Detection,
               Segmentation, and Tracking with an Automated Annotation Pipeline},
  booktitle = {5th DataCV Workshop and Challenge, in conjunction with
               IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}