Skip to content

Makeup Removal (Reference-Free) Not Working with Provided MT Masks : Almost No Change; Custom Masks Work on MT but Not on Real Images + 9+ min per Image on RTX 3080 Ti  #7

@Muaz65

Description

@Muaz65

Hi @Justin900429 and team,
Thank you for the amazing work on MAD
We are trying to use MAD for reference-free makeup removal (--source-label 0 --target-label 1) on real-world images (high-res selfies), but we are facing several critical issues despite extensive testing.

1. Provided MT dataset masks produce almost no makeup removal

When running the official example:

python generate_translation.py \
  --config configs/model_256_256.yaml \
  --save-folder removal_results \
  --source-root data/mtdataset/images \
  --source-list assets/mt_makeup.txt \
  --source-label 0 --target-label 1 \
  --num-process 1 \
  --opts MODEL.PRETRAINED Justin900/MAD

→ The output is nearly identical to the input (almost zero visible makeup removal), even though the progress bars run for ~9-10 minutes for a single image.
We confirmed:

The parsing masks exist in data/mtdataset/parsing/makeup/
The code loads them (no "mask not found" warnings)
But the blending appears to be ignored or ineffective

  1. Custom-generated masks (Segformer + remap) work perfectly on MT dataset
    When we generate masks using jonathandinu/face-parsing + the CONVERT_DICT from misc/convert_beauty_face.py, makeup removal works excellently on the same MT images (clean, natural results).
    Here is our mask generation script (tested and working):
import argparse
import numpy as np
import torch
from PIL import Image
from torch import nn
from transformers import SegformerForSemanticSegmentation, SegformerImageProcessor

# Official remap for jonathandinu/face-parsing (Segformer) to MAD labels (from code CONVERT_DICT + analysis)
CONVERT_DICT = {
    1: 4,   # skin → face skin
    2: 8,   # nose → nose
    4: 6,   # l_eye → left eye
    5: 1,   # r_eye → right eye
    6: 7,   # l_brow → left eyebrow
    7: 2,   # r_brow → right eyebrow
    8: 3,   # l_ear → left ear
    9: 5,   # r_ear → right ear
    10: 11, # mouth → teeth
    11: 9,  # u_lip → upper lip
    12: 13, # l_lip → lower lip
    13: 12, # hair → hair
    15: 0,  # ear_r (earring) → bg
    16: 0,  # neck_l (necklace) → bg
    17: 10, # neck → neck
    18: 0,  # cloth → bg
    3: 0,   # eye_g (glasses) → bg
    14: 0,  # hat → bg
    0: 0,   # bg
}

def generate_mask(img_path: str, save_path: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    image_processor = SegformerImageProcessor.from_pretrained("jonathandinu/face-parsing")
    model = SegformerForSemanticSegmentation.from_pretrained("jonathandinu/face-parsing")
    model.to(device)
    image = Image.open(img_path).convert("RGB")
    inputs = image_processor(images=image, return_tensors="pt").to(device)
    outputs = model(**inputs)
    logits = outputs.logits
    upsampled_logits = nn.functional.interpolate(
        logits,
        size=image.size[::-1],
        mode="bilinear",
        align_corners=False,
    )
    labels = upsampled_logits.argmax(dim=1)[0]
    labels = labels.cpu().numpy()
    new_labels = np.copy(labels)
    for key, value in CONVERT_DICT.items():
        new_labels[labels == key] = value
    new_labels = new_labels.astype("uint8")
    new_labels = Image.fromarray(new_labels)
    new_labels.save(save_path)

generate_mask('/home/muaz/Desktop/Upwork/makeup-removal/MAD/data/custom/images/3147_aligned_256.jpg', '/home/muaz/Desktop/Upwork/makeup-removal/MAD/data/custom/parsing/3147_aligned_256.png')

→ These masks give excellent removal on MT images
→ But give poor results (incomplete removal, artifacts) on real client images.

  1. Questions for the authors
    Could you please clarify:

What exact preprocessing / alignment is required for reference-free makeup removal to work reliably on real-world images?
(e.g., face centering, eye alignment, zoom level, like MT dataset style?)
Why do the official MT parsing masks result in almost no change?
Are the MT parsing masks already in MAD label space, or do they require conversion (like BeautyFace)?
The labels in MT masks appear different from the expected 1/6 (eyes), 9/13 (lips), 4 (skin).
Performance: Is 9+ minutes per 256×256 image normal on an RTX 3080 Ti?
We are seeing:

encode:  699/699 [04:50<...]
generate: 700/700 [04:52<...]

Environment:

python=3.10.19
torch=2.7.1+cu118
transformers=4.52.3
diffusers=0.33.1
accelerate=1.7.0
dlib=20.0.0
opencv-python=4.11.0.86
GPU: RTX 3080 Ti (laptop)

Here is a google doc link for the results and comparison

https://docs.google.com/document/d/1cYzsazBdc7zHmwKC1uZ1DZ-5ESwN86uXDX6IkxNP6Ik/edit?usp=sharing

overall output logs

python generate_translation.py     --config configs/model_256_256.yaml     --save-folder removal_results     --source-root data/mtdataset/images     --source-list assets/mt_makeup.txt     --source-label 0     --target-label 1     --num-process 1     --opts MODEL.PRETRAINED Justin900/MAD
╒════════════════════════════════════════════════╕
│ Configuration                                  │
╞════════════════════════════════════════════════╡
│ EVAL:                                          │
│   BATCH_SIZE: 4                                │
│   ETA: 0.01                                    │
│   REFINE_ITERATIONS: 0                         │
│   REFINE_STEPS: 0                              │
│   SAMPLE_STEPS: 1000                           │
│   SCHEDULER: 'ddpm'                            │
│ MODEL:                                         │
│   BASE_DIM: 128                                │
│   DOWN_BLOCK_TYPE: ['CrossAttnDownBlock2D',    │
│      'CrossAttnDownBlock2D',                   │
│      'CrossAttnDownBlock2D',                   │
│      'CrossAttnDownBlock2D',                   │
│      'CrossAttnDownBlock2D',                   │
│      'DownBlock2D']                            │
│   IN_CHANNELS: 3                               │
│   LABEL_DIM: 2                                 │
│   LAYERS_PER_BLOCK: 2                          │
│   LAYER_SCALE: [1, 1, 2, 2, 4, 4]              │
│   OUT_CHANNELS: 3                              │
│   PRETRAINED: 'Justin900/MAD'                  │
│   UP_BLOCK_TYPE: ['CrossAttnUpBlock2D',        │
│      'CrossAttnUpBlock2D',                     │
│      'CrossAttnUpBlock2D',                     │
│      'CrossAttnUpBlock2D',                     │
│      'CrossAttnUpBlock2D',                     │
│      'UpBlock2D']                              │
│ PROJECT_DIR: 'runs/mixup_256_256'              │
│ PROJECT_NAME: 'Makeup Transfer with Diffusion' │
│ TRAIN:                                         │
│   BATCH_SIZE: 4                                │
│   EMA_INV_GAMMA: 1.0                           │
│   EMA_MAX_DECAY: 0.9999                        │
│   EMA_POWER: 0.75                              │
│   GRADIENT_ACCUMULATION_STEPS: 8               │
│   GRAD_NORM: 1.0                               │
│   IMAGE_SIZE: 256                              │
│   LOG_INTERVAL: 20                             │
│   LR: 0.0001                                   │
│   LR_WARMUP: 1000                              │
│   MAKEUP: None                                 │
│   MAX_ITER: 700000                             │
│   MIXED_PRECISION: 'fp16'                      │
│   NOISE_SCHEDULER:                             │
│     BETA_END: 0.02                             │
│     BETA_START: 0.0001                         │
│     PRED_TYPE: 'epsilon'                       │
│     TYPE: 'linear'                             │
│   NUM_WORKERS: 4                               │
│   RESUME: None                                 │
│   ROOT: ['data/mtdataset/images']              │
│   SAMPLE_INTERVAL: 10000                       │
│   SAMPLE_STEPS: 1000                           │
│   SAVE_INTERVAL: 10000                         │
│   TEXT_LABEL_PATH: 'data/mt_text_anno.json'    │
│   TIME_STEPS: 1000                             │
│ _BASE_: 'base.yaml'                            │
╘════════════════════════════════════════════════╛
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1764872594.935304 3193079 gl_context_egl.cc:85] Successfully initialized EGL. Major : 1 Minor: 5
I0000 00:00:1764872595.112303 3193394 gl_context.cc:344] GL version: 3.2 (OpenGL ES 3.2 NVIDIA 580.95.05), renderer: NVIDIA GeForce RTX 3080 Ti Laptop GPU/PCIe/SSE2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
encode: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 699/699 [04:50<00:00,  2.40it/s]
generate: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [04:52<00:00,  2.39it/s]
encode: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 699/699 [04:51<00:00,  2.40it/s]
generate: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [04:51<00:00,  2.40it/s]
  0%|                                                                                                                            | 0/2 [19:25<?, ?it/s]

We love the model and want to use it in production, but currently it's not reliable on real images despite working perfectly with our own masks on the training data.
Any guidance on correct preprocessing, mask expectations, or known issues would be immensely appreciated!
Thank you so much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions