A voice-controlled camera system with heads-up display (HUD) functionality, designed to work with Ray-Ban smart glasses. The system uses offline speech recognition to process commands and integrates real-time computer vision capabilities for face detection, hand tracking, and object recognition.
The project demonstrates advanced integration of audio processing, computer vision, and real-time overlay rendering, all optimized for wearable technology constraints.
Speech Recognition: Implemented using Vosk's offline model, processing audio in real-time with a dedicated thread. Custom command parser with confidence thresholds and debouncing logic to ensure accurate command detection.
Computer Vision Pipeline: OpenCV-based processing chain handling multiple detection tasks simultaneously. Face detection uses Haar cascades for efficiency, while hand tracking leverages MediaPipe for accuracy. Frame processing optimized to maintain 30+ FPS.
Overlay System: Custom rendering engine that draws information directly onto video frames. Dynamic positioning based on detected objects, with fade-in/fade-out animations for smooth user experience.
Audio-Video Synchronization: Managing separate threads for audio recording and video capture while maintaining perfect sync. Solved using timestamp-based alignment and buffer management.
Performance Optimization: Balancing multiple CV tasks without frame drops. Implemented selective frame processing and adaptive quality settings based on system load.
Command Accuracy: Preventing false positives in noisy environments. Added confidence thresholds, command debouncing, and context-aware filtering.