Pre-trained weights: R50+ViT-B/16
Input shape: (224, 224, 3)
Encoder trainable: False
Batch size: 24
Epochs: 1,290
Input Image (224, 224, 3) → Vision Transformer (224, 224, 32) → Conv1x1 (224, 224, 1)
Total params: 100,840,640
Trainable params: 7,389,312
Non-trainable params: 93,451,328
- Remove malformed images from entire dataset.
- Fill missing polygons from label data using CVAT. (Thanks to @kdh93)
- Exports its ROI, binary masks and transformed original images.
Used tf.data.Dataset to boost training performance
- Scaling (1.0 / 255)
- Random Flip Left-Right
- Random Flip Up-Down
- Random Crop
- Random Brightness (-0.2 ~ +0.2)
- Gaussian Noise (mean = 0, stddev = 0.05)
- BCE * 0.5 + Dice * 0.5
- Binary IoU
- Number of classes = 2 (default)
- Threshold = 0.5
- SGD
- Momentum = 0.9
- Learning Rate Scheduler
- Cosine Annealing Warmup Restarts
- First cycle steps = 100
- Initial learning rate = 1e-3
- First decay steps = 300
- t_mul = 1.0
- m_mul = 1.0 (default)
- alpha = 0.0 (default)
- Cosine Annealing Warmup Restarts
Cosine Annealing Warmup Restarts:



