Ghada Zamzmi is an artificial intelligence researcher and regulatory science expert, having previously worked for both the National Institute of Health (NIH) and the Food and Drug Administration (FDA) before joining HeartFlow as Regulatory Science Principal, where she works on artificial intelligence and regulatory science for the development of cardiac AI technologies. She is also one of the organisers of today's BRIDGE workshop, which aims to connect the fields of AI, medical imaging, and regulatory science. We talked to Ghada about her interest in regulatory science, her workshop, and her advice to new researchers in medical imaging.
So, my background is actually in AI development; I did my undergrad in computer science so I love coding, I like to build models, I like to train models. And I did my grad school in machine learning for healthcare where we focused on building multi-model systems for estimating pain in babies and infants in the [neo-natal intensive care unit] NICU. So we were analyzing facial expression, body movements, voice, and combining all these signals to try to understand if the baby will have pain in the next 3-4 hours. I directly worked with nurses and physicians, and it started to shift me toward real-world challenges.
So, when we started working in this, um, project, we used a real-world yet relatively small dataset; we built a multimodal AI system, I train it, test it,the results look good to me, and then we test it in the NICU, things change. This is the first time I realized data drift or data shift is a huge problem, because we think the model is working, but when we start deploying the model, it starts to drift .
After graduating, I joined the NIH and continued working on something similar. We focused on building multimodal AI systems for different applications. One of them was predicting the severity of sickle cell disease in Black African patients. We build a multimodal system, we analyze imaging biomarkers, blood tests, patient clinical records, we combine all of them to generate a risk score. We submitted a patent for this multimodal system, which was working very nicely.
NIH has (at least used to have) ongoing collaborations with clinics in Africa, where we shared models or prototypes and sometimes received data in return. These partnerships were incredibly valuable in helping NIH researchers assess the generalizability of their models. At one point, we had a model that we believed was working well, so we decided to test it on data from one of these clinics. That experience made the challenges of real-world deployment crystal clear to me, and open my eyes to the reality that we can develop the best state of the arts model that perform well on hundreds of patients in one hospital or country, but when tested elsewhere, its performance can begin to drop due to deployment challenges such as differences in clinical protocols, imaging acquisition methods, formatting and standardization, computational resources, and human-related factors, issues that are frequently overlooked during development.
In 2022, I made the decision that while I enjoy building multimodal AI systems, I need to shift my focus toward understanding why and when models fail, and how to ensure they work reliably, safely, and equitably when deployed across diverse, real-world settings. That realization is what led me to join the regulatory science AI group at the FDA.
I think performance is actually only one small component that we look at when we review medical AI products for safety. The most important part is to fully understand the risk profile of the technology. So, when it comes to building a new technology, there are different components that will lead to risk, and we need to understand what is the risk profile for this technology, and then develop a plan to mitigate this risk. When I was reviewing a medical AI product, I had to understand the risk profile of this technology as a whole, the risk profile for each part of the technology individually. And then, try to understand how this will impact patients, like, or the end user.
Take, for example, an AI model designed to triage patients with suspected cardiac conditions using multimodal data combining echocardiograms, EHR information, and vital signs. This system might include an image quality checker for the echo scans, an ML model that estimates cardiac function (like ejection fraction or valve abnormalities), and a triage module that prioritizes cases based on urgency. Each of these components introduces risks, and each one must be validated, its risks measured, and its potential failure modes anticipated. If the quality checker fails to flag a poor-quality echo, maybe due to technician variability, the cardiac diagnosis could be off. If the diagnostic model was trained mostly on middle-aged patients, it might not generalize to older adults or children. And if the triage algorithm prioritizes a stable patient over someone in acute distress due to missing data or edge cases, care could be dangerously delayed.
Even when it comes to assessing accuracy, we need to ask: what does 99% actually mean in a clinical context? Does that mean 1 in 100 patients is misclassified? In some applications, one missed case can mean irreversible harm. We also need to ask: what's the right metric for the task? AUC? Sensitivity? NPV? Do they reflect clinical relevance and priorities? And how do we determine acceptable thresholds of perfomance? Should AI match expert performance? Or align with patient outcomes?
Also, if the device includes an explainability component, it must be carefully evaluated to ensure it benefits the user rather than confusing or misleading them. We need to assess whether the explanation aligns with clinical reasoning, increases trust appropriately, and supports better decision-making. We also need to ensure the device performs well across all relevant subgroups, such as different age groups, sexes, comorbidity profiles, or imaging modalities. Every claim in the intended use statement must be supported by solid evidence. If such evidence is missing, additional data must be collected before deployment. Finally, as overall and subgroup performance is not static but evolves over time, mechanisms should be in place to monitor performance continuously and address emerging risks as they arise.
That's why we need more than accuracy; we need comprehensive evaluation that goes beyond performance metrics, robust risk assessment and mitigation strategies, and full transparency to ensure the safe deployment of AI models.
I don't think there is anything that's fully safe. But we have to be transparent because we need to communicate to the end-user, and we need to approve products with specific labeling. It's similar to drugs. There is no single drug that doesn't have side effects. But as a patient, I want to understand what I am taking. So, accuracy is one small aspect, but we have to have a full understanding of the risk profile.
I think what we need right now, especially with AI regulations still new and evolving, is to start building bridges, not just between regulators in different countries, but also across the research community and the industry leading the development of these technologies. That's exactly the goal of the BRIDGE workshop.
I don't think it's reasonable to aim for identical regulations across countries. Patient needs and healthcare systems vary, and we should respect those local differences. But at the same time, we do need a set of shared global principles to help us move forward. If we want to build AI technologies that are safe, trusted, and usable worldwide, we need some level of alignment.
Developing a medical AI product is a long and complex process. It requires careful system design, thoughtful data selection, model optimization, human-in-the-loop workflows, and strong collaboration across teams. If developers are forced to significantly alter their processes to meet vastly different regulatory requirements in each region, it makes the entire effort harder, slower, more expensive — and in many cases, not scalable. That's a real slowdown of innovation.
So how do we create AI regulations that are flexible, yet still ensure global safety, effectiveness, and trust — while respecting local needs? We need to build bridges. That starts by opening channels of dialogue, not just among regulators across countries, but also across the research community, healthcare practitioners, and industry. We need shared principles and global guidance on what makes data representative and regulatory-ready, what defines a safe system, and what determines whether a model is ready for deployment. We need harmonized terminology, evaluation criteria, and aligned goals, even if the implementation details vary across regions.
Because if we don't coordinate, we risk creating fragmented solutions that can't scale, and can't be trusted globally. Global AI development requires global conversations, and shared commitments to innovation that puts patient safety, equity, and transparency first.
Some argue that regulation slows innovation. But I honestly believe it's the opposite. Without strong AI regulations, we risk building products that don't work as intended or can't be deployed safely. If we only focus on building models without understanding deployment challenges, anticipated risks, or regulatory expectations, we'll keep creating systems that no one can actually use, which is a waste of time and money.
When it comes to regulation, I have talked with people, some of them, they're my friends. They always think of AI regulation as a bad thing, because it will slow innovation. It will slow development. And I disagree with this, because of one important thing. I think the end goal of regulation is safety. That's the most important thing, because any kind of regulation in the world is built to ensure the end user safety. If we think, "okay, we just need to develop new models, we don't need to understand deployment challenges, we don't need to understand regulation,” we will end up building models, but models that no one can use, so it's a waste of time and money. And if we can't scale our technologies globally because of misalignment, isn't that the real barrier to innovation?
This workshop came to my mind when I was in Morocco last year. On the last day of MICCAI, it was pointed out that the majority of papers were focused on development, with very few addressing evaluation. And when it comes to reporting the performance of different AI technologies, most people report a single number, with very few including uncertainty or deviation metrics. We cannot really fully understand the performance of a device if you're just giving me one number.
And there is growing pressure on regulatory agencies to develop new guidelines that keep up with emerging technologies. When I was at MICCAI, many people stopped me and asked why we don't have regulations for LLMs or generative AI. And I was thinking, how can we regulate something we don't yet fully understand or know exactly how to measure?
Fully understanding how to evaluate new technology, or the risk profile of emerging AI systems is not something regulators can solve on their own. They need to work with AI researchers, collaborate with scientists, and engage with the developers building these technologies in industry. That's why I structured the workshop around those three key groups: academia, regulation, and industry. If we truly want to develop meaningful solutions, we need all of them at the table, working together.
I really want people to sit together and talk, because at the end of the day, we can publish thousands of papers, but if you ask me, only a small fraction of those papers will ever translate into real products that can actually be deployed and used safely in clinical settings.
If we truly want to deliver innovative AI solutions into clinical practice and benefit patients, we have to move beyond research for the sake of research. We need to deliver technologies that are safe, effective, and usable, and the only way to do that is by bridging the gaps between academia, industry, and regulatory bodies. The people developing the technology need to collaborate with those evaluating it, and both need to engage with regulators to ensure that AI systems can actually reach patients.
And we can't forget the end users, the clinicians and patients these technologies are meant to serve. That perspective is critical. While it's not part of BRIDGE this year, maybe next year we'll go even further and invite patients themselves to join the conversation.
Yeah, so, like, LLM is not just all bad. I think, honestly, each of these new technologies can help regulation if we better develop them. For example, because it has this very nice component of being able to communicate easily with humans, we can integrate LLM's as a tool that communicates risk profile to the user. So, LLM's can be used to enhance regulation, but we need to be very careful when we use it. So, like, if I am a doctor or a patient, there's this very new technology to explain to me the side effects, to explain to me the different risk profiles of different technology, to explain to me, for example, that I need to be careful for a specific group. This can be very helpful.
But if you're telling me I'm gonna use LLM as the major device that gives treatment recommendations, no. We have to pause, because you cannot do this. LLM's, they have this hallucination problem. They might come up with things that's not true. So, we cannot do this yet until we have very strong ways to understand when these models make mistakes.
But carefully using an LMM as a way to communicate clearly and easily with the user, it's a good use of technology to enhance regulation.
Yeah, LLMs aren't all bad. Honestly, new technologies can really help regulation if we use them carefully and get them right. For example, since LLMs chat naturally, we could have an LLM break down a device's risk profile for a doctor or patient: explain side effects, compare risks, or even flag "hey, watch out for this group.” That'd be super helpful.
But if we're thinking "let's use an LLM as the main engine for treatment recommendations,” we have to pause. LLMs can hallucinate, they sometimes make things up. We can't trust them for core clinical decisions until we've nailed hallucination detection and verification.
Still, carefully using an LLM as a communication layer, while keeping humans in the loop, is a good way to boost transparency and understanding of AI model behavior and risks.
So I'm very excited about South Korea. It's gonna be my first time there, and I'm really so excited about this. I really appreciate that, like, they do, like, these conferences every year in different places.
I'm sure, we will see a lot of papers related to foundation models, because, there's a lot of excitement around them, and it's actually a very interesting technology. I think it's one of the most important technologies that will help us move forward in the field. I think foundation models are similar to how we humans think.
We don't learn narrow tasks all the time. We tend to know the specific baseline, but then we generalize. So, it's very interesting, and I know, there is now a lot of interest in using foundation models for medical imaging tasks, but you know, this is also something that I would be so interested to talk with people about, lhow do we regulate something like this? I'm also interested to talk with researchers working on continual learning, LLMs, and agentic AI to discuss potential solutions for evaluating and regulating these new technologies. These represent different regulatory challenges that I'd love to discuss with researchers to find soultions for.
Foundation models and generative AI really exacerbate what I think is a core regulatory challenge with AI: its heterogeneity. By their very nature, these models can perform multiple functions, presenting a vast array of different risk profiles. This multifunctionality means that our usual prescriptive, one-size-fits-all regulation approaches won't work. Even when multiple product codes can be assigned to a device, our current evaluation frameworks struggle with assessing how deeply interconnected functions in AI systems interact and create emergent behaviors. The challenge is ensuring our assessment methods can actually capture the complex risk profiles these systems present.
And if we want to move forward in using this technology, which I think we should, we need to evolve our current regulatory frameworks. For continual-learning systems, the regulatory challenge becomes even more complex, how do you approve a device that's designed to change its behavior over time, which can lead to new risks? The FDA currently requires models to be locked before deployment . But continual-learning systems are designed to evolve based on new data. This creates fundamental questions: if the model learns from biased local data, could it drift toward discriminatory behavior? What is the new risk profile? How do we ensure these models maintain safety and efficacy? What happens if the learning algorithm encounters cases that cause catastrophic forgetting of previously learned, critical medical knowledge? We'd essentially be approving a device that we know will have a new risk profile over time, while the FDA does handle software updates and enables incremental changes to the model through PCCP, this level of continuous autonomous change is different and introduces new regulatory-science challenges.
And with agentic AI systems that can make autonomous decisions and take actions, we're looking at new categories of risk. Moving toward technologies that don't just provide recommendations but take actions, such as autonomously adjusting treatment protocols, modifying drug dosages, or even triggering emergency interventions, entails a lot of risks and complicates regulations.
So, the current approach of evaluating static, single-task devices with locked algorithms might not be able to adapt to emerging technologies that are moving toward general and autonomous models, and maybe we need to start thinking about developing new regulatory-science-driven frameworks for multi-task, adaptive, and autonomous system. We should start to evolve our thinking, but we also need to think carefully about how we're going to regulate something like this, while ensuring safety and efficacy; this is something I look forward to discuss with researchers at MICCAI in South Korea.
I think it would be really valuable for attendees to approach MICCAI with a more critical mindset. This is actually one of the most important lessons I learned during my own journey as a student. When you're early in your career and attend conferences like this, there's a natural tendency to be deferential to established researchers and prominent figures in the field. You see these well-known researchers presenting their work, and there's this assumption that everything they're doing represents the best direction for the field.
What I'm advocating for is thinking independently. Yes, these are exceptional researchers conducting important work, but we need to maintain our critical thinking abilities. The danger of accepting everything at face value is that it leads to what I'd call "follow-up research" - work that builds on existing approaches without questioning their basic assumptions or limitations.
The most significant breakthroughs have come from researchers who identified real gaps - not just technical gaps, but bigger picture ones. It's really challenging to recognize these opportunities when we're caught up in current trends or too influenced by what the most visible researchers are working on.
My advice is this: Step back from the immediate excitement of new techniques and trending topics. Ask yourself , what fundamental problems in medical imaging and AI remain genuinely unsolved? What assumptions are we making that might be constraining our thinking? Where are the disconnect between what we're optimizing for in our research and what actually matters for clinical outcomes and patient care?
The field needs researchers who can think independently about where we should be going, not just those who can execute well on where we currently are. That kind of perspective requires developing confidence alongside technical skills, and of course a lot of multidisciplinary discussion and collaboration.