Creating large, high-quality datasets for training AI models like Stable Diffusion often involves the tedious task of manually captioning images. Thankfully, advancements in computer vision, particularly with models like Microsoft's Florence 2, offer a compelling solution for automation. This article explores leveraging Florence 2's impressive image captioning capabilities to streamline the process of generating captions for your Stable Diffusion datasets.


The Power of Florence 2


Florence 2, a state-of-the-art vision language model from Microsoft, excels at understanding and describing images with remarkable accuracy. Unlike many other models, Florence 2 doesn't just offer generic captions; it provides detailed and context-aware descriptions, making it ideal for crafting effective prompts for Stable Diffusion.
Implementation and Performance

Here's a basic Python script demonstrating how to use Florence 2 for image captioning:


import os
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)

image_dir = './images'


def caption_image(task_prompt, image_path):
    image = Image.open(image_path)
    inputs = processor(text=task_prompt, images=image, return_tensors="pt")
    inputs = {key: value.to(device) for key, value in inputs.items()}
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        early_stopping=False,
        do_sample=False,
        num_beams=3,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
    return parsed_answer


for filename in os.listdir(image_dir):
    if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
        image_path = os.path.join(image_dir, filename)
        result = caption_image('<MORE_DETAILED_CAPTION>', image_path)

        detailed_caption = result.get('<MORE_DETAILED_CAPTION>', '')

        print(detailed_caption)
       
        result_path = os.path.splitext(image_path)[0] + '.txt'
        with open(result_path, 'w') as result_file:
            result_file.write(detailed_caption)
        print(f'Processed and saved: {filename}')

print('All images processed.')

    

This basic script reads images from an "images" directory and saves the generated captions in separate .txt files within the same directory.

Note: For Windows users, it's highly recommended to download a pre-compiled wheel package for flash_attn from https://github.com/Dao-AILab/flash-attention/releases. Building it directly on Windows can be challenging and significantly slower.


Performance Observations


Initial tests conducted on a single RTX 4090 showcased Florence 2's impressive speed. It captioned approximately 7000 images per hour using about 12GB of RAM. This speed was achieved by captioning images sequentially. Given the resources used, there's a high probability that increasing the batch size could further enhance performance. In comparison to InternLM-XComposer2, another capable vision model for this task, Florence 2 was at least 5-6x the speed of InternLM-XComposer2-7B-Q4.


Microsoft's Florence 2 presents an efficient and effective solution for automating image captioning, making it a valuable tool for anyone building datasets for Stable Diffusion or other AI models. Its speed, coupled with its ability to generate detailed captions, significantly reduces the manual effort involved in dataset preparation. This is a great sign for the future health of open source diffusion models.