Microsoft’s Azure AI team has released a new vision foundation model called Florence-2 on Hugging Face. This model, available under a permissive MIT license, can handle various vision and vision-language tasks using a prompt-based representation. It comes in two sizes – 232M and 771M parameters – and has already shown exceptional performance in tasks such as captioning, object detection, visual grounding, and segmentation.
What sets Florence-2 apart is its ability to provide a unified approach to handling different types of vision applications. This means that enterprises no longer need to invest in separate task-specific vision models that may not go beyond their primary function. By utilizing Florence-2, companies can save on costs and streamline their operations.
Vision tasks are more complex than text-based natural language processing, as they require comprehensive perceptual ability. To achieve a universal representation of diverse vision tasks, a model must understand spatial data at different scales, from object location to pixel details and semantic details. Microsoft faced two challenges in developing Florence-2: the scarcity of comprehensively annotated visual datasets and the absence of a unified pretraining framework with a singular network architecture that integrated spatial hierarchy and semantic granularity.
To overcome these challenges, Microsoft generated a visual dataset called FLD-5B using specialized models. This dataset included 5.4 billion annotations for 126 million images, covering high-level descriptions to specific regions and objects. Using this data, Florence-2 was trained using a sequence-to-sequence architecture that integrates an image encoder and a multi-modality encoder-decoder. This architecture allows the model to handle various vision tasks without requiring task-specific modifications.
Florence-2 outperforms larger models in terms of performance. In a zero-shot captioning test on the COCO dataset, both the 232M and 771M versions of Florence-2 outperformed Deepmind’s 80B parameter Flamingo visual language model. They also performed better than Microsoft’s own visual grounding-specific Kosmos-2 model. Additionally, when fine-tuned with public human-annotated data, Florence-2 competed closely with larger specialist models in tasks like visual question answering.
Both pre-trained and fine-tuned versions of Florence-2 232M and 771M are available on Hugging Face under a permissive MIT license. This allows for unrestricted distribution and modification for commercial or private use. Developers can now leverage Florence-2 to eliminate the need for separate vision models for different tasks. This not only saves developers time but also reduces compute costs significantly.
In conclusion, Microsoft’s Florence-2 model is a game-changer in the field of AI vision. Its unified approach and exceptional performance make it an attractive choice for enterprises looking to streamline their vision applications. By eliminating the need for task-specific models, Florence-2 offers cost savings and improved efficiency. It will be exciting to see how developers utilize this model in their projects and the impact it will have on the AI industry as a whole.