This project demonstrates a monocular depth estimation model built with TensorFlow and Keras. It uses a U-Net-like architecture to predict depth maps from single RGB images, leveraging the DIODE dataset for training and validation.
Features
The implementation includes several advanced deep learning techniques for computer vision:
- U-Net Architecture: Utilizes skip connections between downscaling and upscaling blocks to preserve spatial information.
- Multi-Component Loss: Combines SSIM, L1, and Edge Smoothness losses for high-quality depth map generation.
- Custom Data Pipeline: Efficiently handles the DIODE dataset with a custom Keras DataGenerator.
- Optimized Training: Uses the Adam optimizer with a specific learning rate schedule for stable convergence.
Code Example
The core of the project is the custom loss function, which balances structural similarity with pixel-wise accuracy:
def calculate_loss(self, target, pred):
# Calculate image gradients for smoothness loss
dy_true, dx_true = tf.image.image_gradients(target)
dy_pred, dx_pred = tf.image.image_gradients(pred)
weights_x = tf.exp(tf.reduce_mean(tf.abs(dx_true)))
weights_y = tf.exp(tf.reduce_mean(tf.abs(dy_true)))
# Structural Similarity (SSIM) Loss
ssim_loss = tf.reduce_mean(1 - tf.image.ssim(target, pred, max_val=1.0, filter_size=7))
# L1 Pixel-wise Loss
l1_loss = tf.reduce_mean(tf.abs(target - pred))
# Total weighted loss
return (0.85 * ssim_loss + 0.1 * l1_loss + 0.9 * depth_smoothness_loss)
Images
Visualizing the predicted depth maps against the ground truth:
<Image src="/depth-prediction-sample.png" alt="Depth Estimation Results" width={800} height={400} />
Video
Watch the model perform real-time depth estimation on a video sequence:
Links and Buttons
Explore the project further or try out the interactive demo:
Visit Demo
View Source Code
Tables
The following table summarizes the hyperparameters used during the training process:
Conclusion
This implementation provides a solid foundation for monocular depth estimation tasks. By combining structural and pixel-wise losses, the model is able to capture both the overall geometry and the fine details of the scene.