Monocular Depth Estimation with Keras

This project demonstrates a monocular depth estimation model built with TensorFlow and Keras. It uses a U-Net-like architecture to predict depth maps from single RGB images, leveraging the DIODE dataset for training and validation.

Features

The implementation includes several advanced deep learning techniques for computer vision:

U-Net Architecture: Utilizes skip connections between downscaling and upscaling blocks to preserve spatial information.
Multi-Component Loss: Combines SSIM, L1, and Edge Smoothness losses for high-quality depth map generation.
Custom Data Pipeline: Efficiently handles the DIODE dataset with a custom Keras DataGenerator.
Optimized Training: Uses the Adam optimizer with a specific learning rate schedule for stable convergence.

Code Example

The core of the project is the custom loss function, which balances structural similarity with pixel-wise accuracy:

def calculate_loss(self, target, pred):
    # Calculate image gradients for smoothness loss
    dy_true, dx_true = tf.image.image_gradients(target)
    dy_pred, dx_pred = tf.image.image_gradients(pred)

    weights_x = tf.exp(tf.reduce_mean(tf.abs(dx_true)))
    weights_y = tf.exp(tf.reduce_mean(tf.abs(dy_true)))

    # Structural Similarity (SSIM) Loss
    ssim_loss = tf.reduce_mean(1 - tf.image.ssim(target, pred, max_val=1.0, filter_size=7))
    
    # L1 Pixel-wise Loss
    l1_loss = tf.reduce_mean(tf.abs(target - pred))

    # Total weighted loss
    return (0.85 * ssim_loss + 0.1 * l1_loss + 0.9 * depth_smoothness_loss)

Images

Visualizing the predicted depth maps against the ground truth:

<Image src="/depth-prediction-sample.png" alt="Depth Estimation Results" width={800} height={400} />

Video

Watch the model perform real-time depth estimation on a video sequence:

Depth Estimation Demo

Links and Buttons

Explore the project further or try out the interactive demo:

Visit Demo

View Source Code

Tables

The following table summarizes the hyperparameters used during the training process:

Conclusion

This implementation provides a solid foundation for monocular depth estimation tasks. By combining structural and pixel-wise losses, the model is able to capture both the overall geometry and the fine details of the scene.