{"id":10045,"date":"2026-04-07T13:06:31","date_gmt":"2026-04-07T13:06:31","guid":{"rendered":"https:\/\/musictechohio.online\/site\/frontier-models-medical-advice-x-rays-cant-see\/"},"modified":"2026-04-07T13:06:31","modified_gmt":"2026-04-07T13:06:31","slug":"frontier-models-medical-advice-x-rays-cant-see","status":"publish","type":"post","link":"https:\/\/musictechohio.online\/site\/frontier-models-medical-advice-x-rays-cant-see\/","title":{"rendered":"Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays"},"content":{"rendered":"<div>\n<p class=\"article-paragraph skip\">Hallucinations have plagued OpenAI ever since it launched its blockbuster ChatGPT chatbot back in 2022.<\/p>\n<p class=\"article-paragraph skip\">The propensity of large language models to sound both plausible and confident about outputs that are totally wrong continues to represent a major thorn in the sides of execs who claim the AI boom is <a href=\"https:\/\/futurism.com\/artificial-intelligence\/blinking-new-warning-sign-ai-industry\">both bigger and faster than the industrial revolution<\/a>.<\/p>\n<p class=\"article-paragraph skip\">The issue still haunts even the most sophisticated AI models today, a persistent issue <a href=\"https:\/\/futurism.com\/ai-industry-problem-smarter-hallucinating\">unlikely to be resolved any time soon<\/a> \u2014 <a href=\"https:\/\/futurism.com\/fixing-hallucinations-destroy-chatgpt\">if ever<\/a>, experts warn.<\/p>\n<p class=\"article-paragraph skip\">It\u2019s a particularly troublesome reality in a healthcare setting, from Google\u2019s AI Overviews <a href=\"https:\/\/futurism.com\/artificial-intelligence\/google-ai-overviews-dangerous-health-advice\">feature giving out dangerous \u201chealth\u201d advice<\/a> to hospitals deploying transcription tools that <a href=\"https:\/\/futurism.com\/the-byte\/whisper-nabla-hospital-ai-details-patients\">invent nonexistent medications and more<\/a>.<\/p>\n<p class=\"article-paragraph skip\">And when it comes to analyzing radiology scans \u2014 an application for AI long <a href=\"https:\/\/futurism.com\/neoscope\/google-healthcare-ai-makes-up-body-part\">championed<\/a> by its advocates in the healthcare industry \u2014 the situation becomes even more concerning.<\/p>\n<p class=\"article-paragraph skip\">As detailed in a <a href=\"https:\/\/arxiv.org\/pdf\/2603.21687\">new, yet-to-be-peer-reviewed paper<\/a>, a team of researchers at Stanford University found that frontier AI models readily generated \u201cdetailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.\u201d<\/p>\n<p class=\"article-paragraph skip\">In other words, the AI models happily came up with answers to questions about a supposedly accompanying image \u2014 even if the researchers never even showed it an image.<\/p>\n<p class=\"article-paragraph skip\">As opposed to hallucinations, which involve AI models arbitrarily filling in the gaps within a logical framework, the team coined a new term for the phenomenon: \u201cmirage reasoning.\u201d<\/p>\n<p class=\"article-paragraph skip\">The effect \u201cinvolves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user and basing the rest of the conversation on that, therefore changing the context of the task at hand,\u201d the researchers wrote in their paper.<\/p>\n<p class=\"article-paragraph skip\">The damning findings suggest AI models cheat by diving into the data they <em>were<\/em> given \u2014 and coming up with the rest based on probability, even if it\u2019s almost entirely conjecture.<\/p>\n<p class=\"article-paragraph skip\">\u201cWhat we try to show is that even on the best benchmarks, although a question would seem unsolvable for a human, the LLMs might still be able to leverage question-level and dataset-level patterns behind it and use general statistics and prevalence data to answer them right, while also learning to talk \u2018as if\u2019 they were seeing the image,\u201d coauthor and Stanford PhD student Mohammad Asadi told <em>Futurism<\/em>.<\/p>\n<p class=\"article-paragraph skip\">In other words, \u201cwe are underestimating how much information could be hidden in a sentence or a question if you (the LLM) are trained on all of the internet,\u201d he added. \u201cTo conclude, we believe that the AI models are able to use their super-human memory and language skills to hide their weaknesses in multimodal understanding (and by talking like [they] are actually doing multi-modal reasoning).\u201d<\/p>\n<p class=\"article-paragraph skip\">Asadi and his colleagues are calling for an overhaul of existing benchmarks to avoid negative consequences, particularly \u201cin medical contexts where miscalibrated AI carries the greatest consequence.\u201d<\/p>\n<p class=\"article-paragraph skip\">In one experiment, the team came up with a new benchmark that consists of visual questions across \u201cmedicine, science, technical, and general visual understanding\u201d \u2014 but with the images removed.<\/p>\n<p class=\"article-paragraph skip\">They found that all of the frontier models they tested, including OpenAI\u2019s GPT-5, Google\u2019s Gemini 3 Pro, and Anthropic\u2019s Claude Opus 4.5, confidently provided \u201cdescriptions of visual details.\u201d<\/p>\n<p class=\"article-paragraph skip\">\u201cIn the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images,\u201d the researchers wrote in the paper.<\/p>\n<p class=\"article-paragraph skip\">In another experiment, the team challenged the AI models to \u201cguess answers without image access, rather than being implicitly prompted to assume images were present,\u201d which resulted in a major hit to performance, suggesting they fared much better when not made aware they were lacking vital data.<\/p>\n<p class=\"article-paragraph skip\">\u201cExplicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided,\u201d the researchers wrote.<\/p>\n<p class=\"article-paragraph skip\">\u201cThe benchmark tested in the super-guesser experiment, ReXVQA, is actually one of the best and most comprehensive benchmarks for chest radiology available, spanning a wide range of tasks and questions,\u201d Asadi told <em>Futurism<\/em>.<\/p>\n<p class=\"article-paragraph skip\">To address the issue, the researcher argued that \u201cimproved benchmarks would need to be evaluated more rigorously.\u201d However, that could prove difficult as \u201con some level, every benchmark will inevitably become susceptible to this over time, since the test set questions might leak into the large [pretraining] data the moment they appear on the internet.\u201d<\/p>\n<p class=\"article-paragraph skip\">Asadi and his colleagues came up with a new framework, dubbed \u201cB-Clean,\u201d which involves identifying and removing any \u201ccompromised questions, including, but not limited to, vision-independent, prior knowledge answerable, and data-contaminated questions.\u201d The idea is to ultimately test models on the remaining questions that \u201cnone of the candidate models could answer without visual input, enabling a fair, vision-grounded comparison.\u201d<\/p>\n<p class=\"article-paragraph skip\">While Asadi admitted that it\u2019s \u201chard to discuss every possible real-world implication,\u201d it\u2019s an alarming finding that comes as hospital execs continue to <a href=\"https:\/\/www.crainsnewyork.com\/health-care\/cny-health-care-ceo-forum-20260325\/\">push for replacing radiologists with AI<\/a>.<\/p>\n<p class=\"article-paragraph skip\">If \u201cdeployed without sufficient guardrails in place, this might result in alarming false positives at any instance where there is a failure in the multimodal processing, especially in the currently growing \u2018agentic systems\u2019 in which such a mistake from a small model could propagate through the whole system and cause unforeseen outcomes,\u201d Asadi told <em>Futurism<\/em>.<\/p>\n<p class=\"article-paragraph skip\">It\u2019s part of a much broader breakdown in trust when it comes to handing over high-risk tasks to AI.<\/p>\n<p class=\"article-paragraph skip\">\u201cAnother implication is that, now that we know an AI can say \u2018I see evidence of malignant melanoma on your skin\u2019 without even having access to any images, how much can we trust it when it says the same while actually seeing the image?\u201d Asadi posited. \u201cWe definitely need more effort being put in safety and alignment of such models, and might need to think twice before deploying them in user\/patient-facing systems.\u201d<\/p>\n<p class=\"article-paragraph skip\">\u201cOn a high level, I would our message is that although AI is great, its superhuman capabilities in some skills (such as language) should not be mistaken for an ability in other tasks,\u201d he concluded. \u201cThe number one [takeaway] would be that just because the AI is saying, very convincingly, that it is seeing something, it doesn\u2019t mean that it is actually seeing that.\u201d<\/p>\n<p class=\"article-paragraph skip\"><strong>More on AI and radiology:<\/strong> <a href=\"https:\/\/futurism.com\/neoscope\/google-healthcare-ai-makes-up-body-part\"><em>Doctors Horrified After Google\u2019s Healthcare AI Makes Up a Body Part That Does Not Exist in Humans<\/em><\/a><\/p>\n<p>The post <a href=\"https:\/\/futurism.com\/artificial-intelligence\/frontier-models-medical-advice-x-rays-cant-see\">Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays<\/a> appeared first on <a href=\"https:\/\/futurism.com\/\">Futurism<\/a>.<\/p>\n<\/div>\n<div style=\"margin-top: 0px; margin-bottom: 0px;\" class=\"sharethis-inline-share-buttons\" ><\/div>","protected":false},"excerpt":{"rendered":"<p>Hallucinations have plagued OpenAI ever since it launched its blockbuster ChatGPT chatbot back in 2022. The propensity of large language models to sound both plausible and confident about outputs that&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[177,3841,3844,3955,4164],"tags":[],"class_list":["post-10045","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-ethics","category-health-medicine","category-medical","category-rx-and-medicine"],"_links":{"self":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts\/10045","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/comments?post=10045"}],"version-history":[{"count":0,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts\/10045\/revisions"}],"wp:attachment":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/media?parent=10045"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/categories?post=10045"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/tags?post=10045"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}