Real-Time Sign Language Translation System
Translates sign language gestures captured via flex sensors into speech in real time — achieving 25ms inference latency and 120ms end-to-end response.
Real-Time Sign Language Translation System
Motivation
Over 70 million people worldwide use sign language as their primary means of communication. Yet the gap between sign language users and those who don't understand it remains wide — most real-time translation systems are either too slow, too expensive, or too dependent on cameras and controlled lighting.
This project explores a different approach: flex sensors on a glove, capturing finger bend angles directly, combined with an on-device LSTM model and a streaming inference pipeline that translates gestures into speech in real time.
System Architecture
The pipeline flows in one direction with minimal latency at each stage:
Flex Sensors (Arduino)
↓
Serial Stream
↓
Sliding Window Buffer
↓
LSTM Inference
↓
Smoothing + Confidence Filter
↓
Gesture → Sentence Mapping
↓
TTS Output + Streamlit UI
Hardware Layer — Arduino + Flex Sensors
Five flex sensors are mounted on a glove — one per finger. As the hand forms a gesture, each sensor changes resistance proportionally to bend angle. The Arduino reads these analog values at high frequency and streams them over serial to the host machine.
Key challenge: Raw sensor data is noisy. Small vibrations, finger tremors, and cable flex introduce jitter that — left unfiltered — causes spurious predictions.
Signal Processing — Sliding Window
Instead of classifying individual sensor readings, the system buffers the last N timesteps into a sliding window. This captures the temporal shape of a gesture — the motion arc matters as much as the final position.
Window parameters were tuned empirically to balance responsiveness and noise rejection.
Model — LSTM Classifier
A multi-layer LSTM is trained on windowed sensor sequences for each gesture class. LSTMs are well-suited here because:
- They model temporal dependencies across the gesture arc
- They handle variable-speed executions of the same gesture
- They generalize across slight sensor drift
Training data was collected across multiple sessions to capture natural variation in gesture speed and hand size.
Post-Processing — Smoothing & Confidence Thresholds
Raw LSTM outputs are passed through a smoothing layer (running average over recent predictions) and a confidence threshold filter. Only predictions above a set confidence score trigger output — this eliminates low-confidence flickering between gesture classes.
Output — TTS + Streamlit UI
Confirmed gesture predictions are mapped to words or phrases and passed to a text-to-speech engine. A Streamlit dashboard displays the live prediction stream, confidence scores, and assembled sentences.
A FastAPI backend exposes real-time prediction endpoints — making the system extensible to mobile clients or web interfaces.
Challenges & Solutions
Sensor drift over time: Flex sensors change resistance slightly as they warm up. Solved with a per-session calibration step that normalizes raw readings to a baseline.
Gesture boundary detection: Knowing when one gesture ends and another begins is non-trivial. Implemented a silence detection threshold — when all sensor readings return to a neutral range, the current gesture is committed.
Latency budget: Keeping end-to-end latency under 150ms required profiling every stage. The bottleneck was TTS synthesis — addressed by using a lightweight local TTS engine rather than a cloud API.
What's Next
- Expanding the gesture vocabulary with more signs
- Training a user-adaptive model that fine-tunes to individual hand geometry
- Exploring camera-based keypoint detection (MediaPipe) as a sensor-free alternative
- Packaging as a standalone Raspberry Pi edge device