New LLMs can pass medical exams - should human doctors be worried?

Progress in large language models (LLM) has been rapid lately and, I suspect, is moving faster than our understanding of what these models are really capable of.  Microsoft's GPT-4 has exhibited evidence of a deeper world-model understanding than even GPT-3.5, which is scary as well as exhilarating.

For the application of helping physicians in practice, an enterprising startup has put out a chat-based app, Nabla, that promises to help physicians with their chart notes.  I am not sure that LLM is mature enough to deploy for this application.  First of all, the software runs on a cloud server, and this is always a concern.  The company claims that it is "HIPAA-eligible" and "GDPR-compliant" but it will have to be approved by hospital or clinic security before it can be deployed.  From what I can see, it outputs rather simple statements based on patient input, and seems to akin to a voice dictation system that is just able to pad snippets into a regular sentence.  It won't create the kind of chart notes that I am accustomed to generating, especially in the Assessment and Plan section, which depends on a knowledge of the literature and interpretation of clinical findings and lab results, and sets down my line of thinking.  So far, I've not encountered software that will save me that effort.  As this software isn't asked to be creative, there is probably no risk of hallucinations or other unwanted side-effects of more complex generative chat.  Never before, has a physician dictated a chart note with confidential and sensitive information to a startup corporate entity before.  As protected information will be exchanged, will each user's input will be stored for use in a future training set?  If so, how is protected information censored? 

In the area of expert systems, great strides have been made by Google with LLM as expert systems.  However it has been recognized that:

The problem is that the medical domain is a special domain. In contrast to other fields, there are different issues and even greater safety issues. As we have seen models like ChatGPT can also hallucinate, and be capable of the spread of misinformation.

In machine learning, the performance of a model is compared to human level performance, but human capabilities are compared with a theoretical level of perfection, and the difference is the Bayes optimal error.  What the AI developer aims for is a model that has a higher level of accuracy than humans, that is, a lower Bayes error. 
Google has been working with a whole slew of language models, but the top performers are the ones based on PaLM and FLAN.  These models have been tested side by side, and while Flan-PaLM had the edge in taking exams, Med-PaLM scored higher with questions likely to be asked by consumers.  This might be because PaLM was trained on databases like Wikipedia and social media. 

But it's amazing that the LLMs could answer test questions like this:

But although these models can pass medical licensing board exams,  I don't feel that these are ready to be deployed in the clinic. 
I've not seen much written about problems that have been reported with other LLMs such as ChatGPT or GPT-3/4, such as hallucination, bias and toxicity.  I have questions about how to "edit" the information that trains the 540 billion parameters of PaLM.  For example, if you train it on a medical document that is found to be erroneous or false, how do you remove this knowledge from the model so that it doesn't make decisions based on that information again?  How does one update the model on new information?  Training is a time-intensive process and the large model requires hardware not readily available in a doctor's office.  Smaller models might provide "good-enough" accuracy with reasonable training time, since in the medical world, training needs to occur regularly.  A model like Flan-PaLM might beat a human oncologist today, but a few years later, an expert oncologist might defeat a model that has not been updated and retrained and validated. 

Right now, it appears that companies like Google see their models deployed in the consumer space, to help with diagnostics and to provide answers to basic health questions.  While I applaud this efforts, I would like to see some effort being made to help the beleaguered clinician who has to parse mountains of new data each month.  Another worthy aim for AI would be to complement human clinicians, and be a true "peripheral brain" as we used to refer to software apps on our PDAs of years past.