
Say It, Spot It: IIT Bombay's New Model Decodes Earth Images With Natural Language
Taking a photo of your living room and asking an artificial intelligence (AI) tool to spot your cat or TV remote may seem like a cool trick. And you’d be surprised how often it gets it right. But for complex satellite or drone images, most state-of-the-art models failed to identify objects reliably with natural language prompts. Until now!
Researchers from the Indian Institute of Technology, Bombay (IIT Bombay), led by Prof Biplab Banerjee, have now developed a new model to address natural language object identification in remote sensing images. Named Adaptive Modality-guided Visual Grounding (AMVG), the proposed framework not only recognises what’s in the image, but also understands what the user is asking, even when the prompts are ambiguous or contextual.
Read also: IIT Bombay And Jaro Education Launches Chief Technology Officer Program
Take, for example, a command like ‘find all damaged buildings near the flooded river’. Humans can fairly reliably execute this command. But to scan through hundreds of cluttered images within minutes, it’s crucial to train machines to analyse remote sensing data with a similar or better accuracy. This process of training a computer to understand descriptions in everyday language and match them with details in images is called visual grounding or phrase grounding.
“Remote sensing images are rich in detail but extremely challenging to interpret automatically. While visual grounding has progressed significantly, current models fail to transfer effectively to remote sensing scenarios, especially when the commands are ambiguous or context-dependent,” explains Shabnam Choudhury, the study’s lead author and a PhD researcher at IIT Bombay.
With every passing year, the volume of such remote sensing data continues to grow exponentially. Captured from large distances above Earth (think of satellites, drones, aircraft), these images are cluttered with tiny objects, atmospheric noise, and scale variations. In these images, a building may appear like a runway and a runway like a river. The IIT Bombay study, published in the ISPRS Journal of Photogrammetry and Remote Sensing, demonstrates how AMVG acts like a sophisticated translation system, interpreting prompts in everyday human language and identifying objects reliably.
But, how did the IIT Bombay researchers achieve this feat? Choudhury explains that most models today employ a two-step method for visual grounding: first, they propose regions, and then they rank them. AMVG, on the other hand, uses the four key innovations, namely the Multi-modal Deformable Attention layer, Multi-stage Tokenised Encoder (MTE), Multi-modal Conditional Decoder, and Attention Alignment Loss (AAL).
The first layer helps AMVG to smartly prioritise regions for the specific query instead of analyzing every pixel equally. The second, MTE, acts like a skilled refiner and interpreter, adapting to difficult prompts step-by-step while aligning features and descriptions. The third, Multi-modal Conditional Decoder, progressively refines the model’s search rather than making a single guess. Think of it like a detective narrowing down suspects by eliminating possibilities. Fourth, and perhaps the most ingenious feature of the model is a new training objective called the Attention Alignment Loss (AAL). This innovative training technique sounds almost philosophical, acting as a teacher guiding a student’s focus.
“Think of AAL like a coach. When a human reads ‘the white truck beside the fuel tank’, their eyes know where to look in an image. AMVG, being a machine, needs help developing that intuition. AAL does exactly that: it teaches the model where to look. If the model’s “attention” drifts too far, AAL gently nudges it back,” Choudhury explains. Together, these four components enable AMVG to “see with context” and “listen with nuance”, setting it miles apart from previous such attempts.
This is not just technological progress. The real-world implications of this model range from disaster response and military surveillance to urban planning and agricultural productivity.
“One of the most exciting applications for us is disaster response,” Choudhury notes. During a flood, earthquake, or wildfire, responders could simply ask the model: ‘Show damaged buildings beside the highway’ and receive precise coordinates. Similarly, even for army personnel trying to ‘spot camouflaged vehicles in dense terrain near the border’ or a farmer aiming to ‘find yellowing crop patches near irrigation lines’, the model can offer real-time insights.
Read also: QS World University Rankings 2026: MIT, Imperial College, And Stanford Tops
Importantly, the researchers have open-sourced the entire model, making AMVG’s complete implementation publicly available on GitHub. This is a rare move in remote sensing research, say the researchers.
“While open-sourcing is becoming more common in the natural image visual grounding community, it’s still relatively rare in the remote sensing space. Many state-of-the-art RS models remain closed or only partially released, which slows down collective progress,” reveals Choudhury.
“Open-sourcing AMVG was a deliberate choice, and a deeply personal one too. We believe that real scientific impact happens when your work doesn’t just sit behind a paywall. By publishing our framework end-to-end, we’re hoping to encourage transparency, reproducibility, and rapid iteration in remote sensing-visual grounding research,” she adds.
Of course, no model is perfect. AMVG still depends on the availability of high-quality, annotated datasets. Its performance may vary across sensors or regions it hasn’t seen before. Although it’s more efficient than previous models, deploying it in real-time or on edge devices needs further optimisation.
But the direction is clear. The team is already working on sensor-aware versions, compositional visual grounding (e.g., “the small hut behind the blue tank near the tree”), and large vision-language models to generalise across sensors, geographies, and tasks.
“Ultimately, we want to push toward a unified RS (remote sensing) understanding system that can ground, describe, retrieve, and reason about any image, in any modality, using natural language,” says Choudhury.
IIT Bombay’s AMVG is not just technically superior, but has the potential to make remote sensing truly accessible for large-scale, real-world applications. By bridging the gap between how we speak and how machines spot, it brings complex Earth observation tools within reach of those who need them most, enabling our journey towards a world that is better planned and better prepared.