Makeup Removal (Reference-Free) Not Working with Provided MT Masks : Almost No Change; Custom Masks Work on MT but Not on Real Images + 9+ min per Image on RTX 3080 Ti 

Hi @Justin900429 and team,  
Thank you for the amazing work on MAD
We are trying to use MAD **for reference-free makeup removal** (`--source-label 0 --target-label 1`) on real-world images (high-res selfies), but we are facing several critical issues despite extensive testing.

### 1. Provided MT dataset masks produce almost [no makeup removal](https://docs.google.com/document/d/1cYzsazBdc7zHmwKC1uZ1DZ-5ESwN86uXDX6IkxNP6Ik/edit?usp=sharing)
When running the official example:

```bash
python generate_translation.py \
  --config configs/model_256_256.yaml \
  --save-folder removal_results \
  --source-root data/mtdataset/images \
  --source-list assets/mt_makeup.txt \
  --source-label 0 --target-label 1 \
  --num-process 1 \
  --opts MODEL.PRETRAINED Justin900/MAD
```

→ The output is nearly identical to the input (almost zero visible makeup removal), even though the progress bars run for ~9-10 minutes for a single image.
We confirmed:

The parsing masks exist in data/mtdataset/parsing/makeup/
The code loads them (no "mask not found" warnings)
But the blending appears to be ignored or ineffective

2. Custom-generated masks (Segformer + remap) work perfectly on MT dataset
When we generate masks using jonathandinu/face-parsing + the CONVERT_DICT from misc/convert_beauty_face.py, makeup removal works excellently on the same MT images (clean, natural results).
Here is our mask generation script (tested and working):

```python
import argparse
import numpy as np
import torch
from PIL import Image
from torch import nn
from transformers import SegformerForSemanticSegmentation, SegformerImageProcessor

# Official remap for jonathandinu/face-parsing (Segformer) to MAD labels (from code CONVERT_DICT + analysis)
CONVERT_DICT = {
    1: 4,   # skin → face skin
    2: 8,   # nose → nose
    4: 6,   # l_eye → left eye
    5: 1,   # r_eye → right eye
    6: 7,   # l_brow → left eyebrow
    7: 2,   # r_brow → right eyebrow
    8: 3,   # l_ear → left ear
    9: 5,   # r_ear → right ear
    10: 11, # mouth → teeth
    11: 9,  # u_lip → upper lip
    12: 13, # l_lip → lower lip
    13: 12, # hair → hair
    15: 0,  # ear_r (earring) → bg
    16: 0,  # neck_l (necklace) → bg
    17: 10, # neck → neck
    18: 0,  # cloth → bg
    3: 0,   # eye_g (glasses) → bg
    14: 0,  # hat → bg
    0: 0,   # bg
}

def generate_mask(img_path: str, save_path: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    image_processor = SegformerImageProcessor.from_pretrained("jonathandinu/face-parsing")
    model = SegformerForSemanticSegmentation.from_pretrained("jonathandinu/face-parsing")
    model.to(device)
    image = Image.open(img_path).convert("RGB")
    inputs = image_processor(images=image, return_tensors="pt").to(device)
    outputs = model(**inputs)
    logits = outputs.logits
    upsampled_logits = nn.functional.interpolate(
        logits,
        size=image.size[::-1],
        mode="bilinear",
        align_corners=False,
    )
    labels = upsampled_logits.argmax(dim=1)[0]
    labels = labels.cpu().numpy()
    new_labels = np.copy(labels)
    for key, value in CONVERT_DICT.items():
        new_labels[labels == key] = value
    new_labels = new_labels.astype("uint8")
    new_labels = Image.fromarray(new_labels)
    new_labels.save(save_path)

generate_mask('/home/muaz/Desktop/Upwork/makeup-removal/MAD/data/custom/images/3147_aligned_256.jpg', '/home/muaz/Desktop/Upwork/makeup-removal/MAD/data/custom/parsing/3147_aligned_256.png')
```

→ These masks give excellent removal on MT images
→ But give poor results (incomplete removal, artifacts) on real client images.

3. Questions for the authors
Could you please clarify:

What exact preprocessing / alignment is required for reference-free makeup removal to work reliably on real-world images?
(e.g., face centering, eye alignment, zoom level, like MT dataset style?)
Why do the official MT parsing masks result in almost no change?
Are the MT parsing masks already in MAD label space, or do they require conversion (like BeautyFace)?
The labels in MT masks appear different from the expected 1/6 (eyes), 9/13 (lips), 4 (skin).
Performance: Is 9+ minutes per 256×256 image normal on an RTX 3080 Ti?
We are seeing:
```text
encode:  699/699 [04:50<...]
generate: 700/700 [04:52<...]
```
Environment:
```text
python=3.10.19
torch=2.7.1+cu118
transformers=4.52.3
diffusers=0.33.1
accelerate=1.7.0
dlib=20.0.0
opencv-python=4.11.0.86
GPU: RTX 3080 Ti (laptop)
```

Here is a google doc link for the results and comparison 
```text
https://docs.google.com/document/d/1cYzsazBdc7zHmwKC1uZ1DZ-5ESwN86uXDX6IkxNP6Ik/edit?usp=sharing
```
overall output logs 
```text
python generate_translation.py     --config configs/model_256_256.yaml     --save-folder removal_results     --source-root data/mtdataset/images     --source-list assets/mt_makeup.txt     --source-label 0     --target-label 1     --num-process 1     --opts MODEL.PRETRAINED Justin900/MAD
╒════════════════════════════════════════════════╕
│ Configuration                                  │
╞════════════════════════════════════════════════╡
│ EVAL:                                          │
│   BATCH_SIZE: 4                                │
│   ETA: 0.01                                    │
│   REFINE_ITERATIONS: 0                         │
│   REFINE_STEPS: 0                              │
│   SAMPLE_STEPS: 1000                           │
│   SCHEDULER: 'ddpm'                            │
│ MODEL:                                         │
│   BASE_DIM: 128                                │
│   DOWN_BLOCK_TYPE: ['CrossAttnDownBlock2D',    │
│      'CrossAttnDownBlock2D',                   │
│      'CrossAttnDownBlock2D',                   │
│      'CrossAttnDownBlock2D',                   │
│      'CrossAttnDownBlock2D',                   │
│      'DownBlock2D']                            │
│   IN_CHANNELS: 3                               │
│   LABEL_DIM: 2                                 │
│   LAYERS_PER_BLOCK: 2                          │
│   LAYER_SCALE: [1, 1, 2, 2, 4, 4]              │
│   OUT_CHANNELS: 3                              │
│   PRETRAINED: 'Justin900/MAD'                  │
│   UP_BLOCK_TYPE: ['CrossAttnUpBlock2D',        │
│      'CrossAttnUpBlock2D',                     │
│      'CrossAttnUpBlock2D',                     │
│      'CrossAttnUpBlock2D',                     │
│      'CrossAttnUpBlock2D',                     │
│      'UpBlock2D']                              │
│ PROJECT_DIR: 'runs/mixup_256_256'              │
│ PROJECT_NAME: 'Makeup Transfer with Diffusion' │
│ TRAIN:                                         │
│   BATCH_SIZE: 4                                │
│   EMA_INV_GAMMA: 1.0                           │
│   EMA_MAX_DECAY: 0.9999                        │
│   EMA_POWER: 0.75                              │
│   GRADIENT_ACCUMULATION_STEPS: 8               │
│   GRAD_NORM: 1.0                               │
│   IMAGE_SIZE: 256                              │
│   LOG_INTERVAL: 20                             │
│   LR: 0.0001                                   │
│   LR_WARMUP: 1000                              │
│   MAKEUP: None                                 │
│   MAX_ITER: 700000                             │
│   MIXED_PRECISION: 'fp16'                      │
│   NOISE_SCHEDULER:                             │
│     BETA_END: 0.02                             │
│     BETA_START: 0.0001                         │
│     PRED_TYPE: 'epsilon'                       │
│     TYPE: 'linear'                             │
│   NUM_WORKERS: 4                               │
│   RESUME: None                                 │
│   ROOT: ['data/mtdataset/images']              │
│   SAMPLE_INTERVAL: 10000                       │
│   SAMPLE_STEPS: 1000                           │
│   SAVE_INTERVAL: 10000                         │
│   TEXT_LABEL_PATH: 'data/mt_text_anno.json'    │
│   TIME_STEPS: 1000                             │
│ _BASE_: 'base.yaml'                            │
╘════════════════════════════════════════════════╛
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1764872594.935304 3193079 gl_context_egl.cc:85] Successfully initialized EGL. Major : 1 Minor: 5
I0000 00:00:1764872595.112303 3193394 gl_context.cc:344] GL version: 3.2 (OpenGL ES 3.2 NVIDIA 580.95.05), renderer: NVIDIA GeForce RTX 3080 Ti Laptop GPU/PCIe/SSE2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
encode: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 699/699 [04:50<00:00,  2.40it/s]
generate: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [04:52<00:00,  2.39it/s]
encode: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 699/699 [04:51<00:00,  2.40it/s]
generate: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [04:51<00:00,  2.40it/s]
  0%|                                                                                                                            | 0/2 [19:25<?, ?it/s]
```

We love the model and want to use it in production, but currently it's not reliable on real images despite working perfectly with our own masks on the training data.
Any guidance on correct preprocessing, mask expectations, or known issues would be immensely appreciated!
Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Makeup Removal (Reference-Free) Not Working with Provided MT Masks : Almost No Change; Custom Masks Work on MT but Not on Real Images + 9+ min per Image on RTX 3080 Ti #7

1. Provided MT dataset masks produce almost no makeup removal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Makeup Removal (Reference-Free) Not Working with Provided MT Masks : Almost No Change; Custom Masks Work on MT but Not on Real Images + 9+ min per Image on RTX 3080 Ti #7

Description

1. Provided MT dataset masks produce almost no makeup removal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions