Building a CLI-Based People Tracking and Dwell Time Analytics System Using YOLOv8 and DeepSORT

 Introduction

Tracking people across video frames and analyzing their behavior (like dwell time) is a crucial task for many real-world applications: retail analytics, security surveillance, smart cities, etc.

In this article, we will build a full command-line interface (CLI) application that:

  • Detects people in a video using YOLOv8.
  • Tracks each detected person across frames using DeepSORT.
  • Calculates each person’s dwell time (time spent visible).
  • Generates an annotated output video with:
  • Bounding boxes
  • Unique IDs
  • Live people count

We’ll optimize for speed and simplicity, while maintaining accuracy and extensibility.

Let’s dive deep.

Git Repo: faheemkhaskheli9/PeopleTracking

Project Setup

First, you must install the following dependencies:

pip install ultralytics deep_sort_realtime opencv-python

Libraries used:

  • ultralytics: Official YOLOv8 implementation.
  • deep_sort_realtime: Real-time DeepSORT tracking.
  • opencv-python: For reading, writing, and processing videos.

Step-by-Step Code Breakdown

Here’s the full explanation of the project code:

1. CLI Argument Parsing

We start by creating a flexible CLI interface:

import argparse

parser = argparse.ArgumentParser(description="Person dwell time tracker with output video")
parser.add_argument("--video", required=True, help="Path to input video file")
parser.add_argument("--conf", type=float, default=0.3, help="YOLO confidence threshold")
parser.add_argument("--max-age", type=int, default=5, help="DeepSORT max_age")
args = parser.parse_args()

This allows users to pass:

  • --video: Input video file.
  • --conf: Detection confidence threshold (default 0.3).
  • --max-age: DeepSORT’s tracker age-out time (default 5).

✅ Good CLI = easy to reuse and configure later.

2. Load YOLOv8 and DeepSORT

from ultralytics import YOLO
from deep_sort_realtime.deepsort_tracker import DeepSort

model = YOLO("yolov8n.pt")
model.classes = [0] # Detect only 'person' class

tracker = DeepSort(max_age=args.max_age, n_init=3, max_iou_distance=0.7)
  • We load YOLOv8n (Nano model) — fastest for real-time.
  • We configure DeepSORT tracker for robust ID assignment.

✅ YOLO = detectionDeepSORT = tracking and re-identification.

3. Prepare Input and Output Videos

import cv2
import os

cap = cv2.VideoCapture(args.video)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

out_path = os.path.splitext(args.video)[0] + "_output.mp4"
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out_vid = cv2.VideoWriter(out_path, fourcc, fps, (width, height))
  • Open the input video.
  • Read its properties (FPS, width, height).
  • Setup an output video writer to save the processed frames.

✅ Output video will show real-time detection and tracking results.

4. Frame-by-Frame Processing

We loop over every frame:

dwell_frames = {}
frame_idx = 0

while True:
ret, frame = cap.read()
if not ret:
break
frame_idx += 1

We maintain a dictionary dwell_frames to count how many frames each ID is visible.

5. Detect People

results = model(frame)[0]
detections = []

for det in results.boxes.data.tolist():
x1, y1, x2, y2, conf, cls = det
if conf < args.conf:
continue
xmin, ymin = int(x1), int(y1)
w, h = int(x2 - x1), int(y2 - y1)
detections.append(([xmin, ymin, w, h], float(conf), int(cls)))
  • Run YOLO on the frame.
  • Extract bounding boxes.
  • Filter detections based on confidence.
  • Format detection for DeepSORT.

✅ Only confident person detections are passed to the tracker.

6. Track People Across Frames

tracks = tracker.update_tracks(detections, frame=frame)
  • DeepSORT assigns a persistent ID to each detected person.
  • Even if someone is temporarily occluded, DeepSORT tries to maintain their ID.

✅ Smooth tracking without frequent ID-switching.

7. Draw Bounding Boxes and Calculate Dwell Time

live_ids = []
for track in tracks:
if not track.is_confirmed():
continue
track_id = track.track_id
bbox = track.to_ltrb()
x1, y1, x2, y2 = map(int, bbox)

dwell_frames[track_id] = dwell_frames.get(track_id, 0) + 1
live_ids.append(track_id)

cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(frame, f"ID {track_id}", (x1, y1 - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

For each confirmed track:

  • Draw green bounding box.
  • Label with ID.
  • Update the frame count for that ID.

✅ Each person gets their own persistent ID across frames.

8. Show Live People Count

live_people = len(set(live_ids))
cv2.putText(frame, f"Live People: {live_people}", (20, 40),
cv2.FONT_HERSHEY_SIMPLEX, 1.0, (255, 0, 0), 3)

We count unique IDs detected in the current frame and show it.

✅ Live people counter shows how many people are actively visible.

9. Save Annotated Frame

out_vid.write(frame)

Write the processed frame to the output video file.

✅ Output video = proof of working detection, tracking, analytics.

10. Cleanup

cap.release()
out_vid.release()

Always release video objects to avoid resource leaks.

11. Print Final Dwell Time Stats

print("\n--- Dwell Times ---")
for track_id, frames in dwell_frames.items():
dwell_seconds = frames / fps
print(f"ID {track_id}: {dwell_seconds:.2f} seconds")
  • After processing, we calculate how many seconds each ID spent in view.
  • Frames → Seconds conversion using video’s FPS.

✅ Each person’s dwell time is printed cleanly.

Example CLI Usage

python dwell_time_tracker_with_video.py --video my_video.mp4

Outputs:

  • my_video_output.mp4 with boxes, IDs, people count.
  • Console printout of dwell time per person.

Function Calling and Tool Use in LLMs

Introduction

Large Language Models (LLMs) were initially used for text generation: answering questions, generating summaries, and translating text. But now, they`re evolved into agents that use external tools (APIs, functions, databases, calculators) to perform complex reasoning and take actions.

What is Function Calling?

LLMs generate structured outputs that call external tools or APIs. The LLM returns a JSON object with arguments that can be passes to a real function.

This JSON object contains name of the function, and the parameters that the function need to execute.

I am using python therefore I will use python function as example.

Example Code

Setup

pip install langchain langchain[groq]

Environment Variable

Setup your GROQAPI key:

export GROQ_API_KEY="your-key"
from langchain.chat_models import init_chat_model
from langchain.tools import tool


# Define a tool the LLM can call
@tool
def get_date(_: str) -> str:
""""Returns current Date"""
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d")


@tool
def get_time(_: str="") -> str:
"""get current Time"""
from datetime import datetime
return datetime.now().strftime("%H:%M:%S")


# Set up the LLM
llm = init_chat_model("gemma2-9b-it", model_provider="groq")
llm_tools = llm.bind_tools([get_date, get_time])

# Use the agent
response = llm_tools.invoke([{"role": "user", "content": "What is the current time?"}])
# Response contain the tool call
print(response)

messages = []

tools_dict = {"get_date": get_date, "get_time": get_time}

# this will execute the function based on the tool call from the llm model
for tool_call in result .tool_calls:
selected_tool = tools_dict[tool_call["name"].lower()]
tool_msg = selected_tool.invoke(tool_call)
messages.append(tool_msg)
print(tool_msg)

What’s Happening?

  • LLM gets the prompt.
  • It decides get_date or get_date should be called.
content='' additional_kwargs={} response_metadata={'model': 'llama3.1', 'created_at': '2025-04-25T16:52:08.0778404Z', 'done': True, 'done_reason': 'stop', 'total_duration': 4427471300, 'load_duration': 81652000, 'prompt_eval_count': 182, 'prompt_eval_duration': 295000000, 'eval_count': 13, 'eval_duration': 4049000000, 'model_name': 'llama3.1'} id='run-ba491667-285d-4c11-885b-e89ef0f02b31-0' tool_calls=[{'name': 'get_time', 'args': {}, 'id': '03b3bdb0-729b-4568-ba00-0539c3272f81', 'type': 'tool_call'}] usage_metadata={'input_tokens': 182, 'output_tokens': 13, 'total_tokens': 195}
  • It generates a JSON that LangChain converts to a function call.
  • The result of the function is shown to the user.
content='21:52:08' name='get_time' tool_call_id='03b3bdb0-729b-4568-ba00-0539c3272f81'

Let’s break it down clearly:

🔧 Tool 1: get_date

@tool
def get_date(_: str) -> str:
"""Returns current Date"""
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d")
  • Tool nameget_date
  • Docstring"Returns current Date"
  • This description is used by the LLM to decide:
  • “If the user asks something related to today’s date, I should call this tool.”
  • Even if the tool doesn’t need input, it must accept one argument (_: str) for compatibility with LangChain’s tool signature.

⏰ Tool 2: get_time

@tool
def get_time(_: str = "") -> str:
"""Returns current Time"""
from datetime import datetime
return datetime.now().strftime("%H:%M:%S")
  • Tool nameget_time
  • Docstring"Returns current Time"
  • So, when the user asks:
  • “What’s the time now?”
    the LLM will read the tool’s description and decide: “Sounds like I should call get_time.”

🧠 Why Are These Descriptions So Crucial?

LangChain sends a structured schema of tools to the LLM like:

{
"name": "get_time",
"description": "Returns current Time",
"parameters": {...}
}

So your tool’s docstring is the only thing the LLM has to understand the tool’s purpose.

If the docstring is too vague (e.g., "Returns a value"), the LLM won’t know when to use it. If it's too specific or inaccurate, it might mislead the LLM.

✅ Best Practice for Writing Tool Docstrings

  1. Describe clearly what the tool does.
  2. Include input/output expectations if needed.
  3. Use natural language — LLMs read this like any text.

🛠 Use Cases

  • Medical AI: Use tools for diagnosis and explanation.
  • Finance Bots: Access real-time stock data and generate reports.
  • Customer Service: Query product APIs and take actions.
  • DevOps Copilot: Execute shell commands or query Grafana dashboards.

🧠 Summary

  • LLMs are no longer just language models — they’re tool-using reasoning machines.
  • LangChain orchestrates tools, memory, and LLMs.

Building a CLI-Based People Tracking and Dwell Time Analytics System Using YOLOv8 and DeepSORT

  Introduction Tracking people across video frames and analyzing their behavior (like  dwell time ) is a crucial task for many real-world ap...