This hobbit came along this article in The Straits Times recently, “How did DeepSeek build its AI with less money?” by Cade Metz. The original article was first published in New York Times on 12 Feb 25.
Some points mentioned in this article hold lessons for us working in healthcare and is certainly worth mulling over by the big shots who design and implement our health care systems and policies.
The overarching theme of DeepSeek’ success was that it achieved just as much by using less. The big American AI companies typically used 16,000 specialised chips (i.e. Graphics Processing Units, or GPUs, produced mostly by Nvidia) to train their LLM (Large Language Model) chatbots. DeepSeek only used about 2000. In doing so, it saved on a lot of resources, including not just chips, but energy as well, because these chips consume a fiendish amount of energy and sending data between these chips consumes even more energy. Such activities release a huge amount of heat in the process. These chips are housed in huge data centre buildings that produce so much heat that they need another building to cool the data centre building.
The article claimed that DeepSeek’s “engineers needed only about US$6M in raw computing power, roughly one-tenth of what Meta spent in building its latest AI technology. It is no exaggeration to say that DeepSeek has demonstrated a quantum leap in efficiency that has completely changed the game in town. Here are a few lessons we can learn from the development of DeepSeek that we can perhaps consider for healthcare:
Lesson 1 – Spread out the work, pair the expert with the generalist
The first strategy and technology DeepSeek employed was to use a method called “mixture of experts”.
“Traditional” (if there is such a word) AI companies employed a single neural network to learn literally everything under the sun. This monolithic approach takes many chips, time and energy. The designers of DeepSeek split the system into many neural networks, each learning one area of expertise. Each smaller neural network concentrated on one particular field. In itself, this is nothing special. What made DeepSeek special was the designers then paired these specialist neural networks with a “generalist” system. This generalist system then helped to coordinate interactions between the many expert neural systems.
Now, it is not uncommon for a single patient, whether inpatient or outpatient, to generate several referrals to other specialists in the hospital or specialist outpatient clinics. There is no generalist involved. Once the polyclinic or family physician makes a referral to the specialist or hospital care system, the patient is then often stuck in the environment for a long time, if not forever. There is no generalist coordinating care there or interactions between specialists. The family physician or generalist only coordinates care when the patient leaves the hospital system. Perhaps we can consider having generalists in the hospitals and specialist outpatient clinics to coordinate care and cut down on unnecessary processes that consume lots of time and resources.
Lesson 2: Do not aim for perfection
We are told that the training of AI neural networks basically relies on multiplication of numbers: “months of multiplication across thousands of computer chips”. These chips pack their numbers into 16 bits of memory. But DeepSeek developers managed to squeeze these numbers into only 8 bits of memory, thereby lopping off “several decimals from each number” and saving a lot of memory space in the process. The answer so produced was less accurate but it did not matter. The article stated that “the calculations were accurate enough to produce a really powerful network”.
But that’s not the end of this story. One now has to multiply all these 8-bit numbers together. DeepSeek now stretched the multiplication answer across 32 bits of memory, and in doing so, made the answer more precise. Which is why DeepSeek performed just as well if not better in certain areas that other AI platforms consuming far more resources.
In healthcare, doctors and other healthcare professionals are reminded that we owe the patient a duty of care. Arising from this duty is the concept of standard of care. What is the standard expected of us in every instance of care we deliver? This used to be determined by our peers, but somewhere along the way, the concept of “best practice” crept in.
Best practice is laudable and of course something we should aspire to give. But does the required standard of care necessarily equate to best practice? This hobbit thinks not but many others think so. When a doctor is found wanting in a disciplinary inquiry, the standard of care quoted is often best practice. And when what was done does not quite qualify as best practice, the doctor can be found to be guilty of negligence or professional misconduct etc.
For example, should a doctor be punished when he did not see the patient personally but relied on his registrar’s assessment, (even though he did see the patient eventually, albeit 12 hours later), or when he did not order a CT scan one day earlier than when he actually did, and relied on blood tests and an erect Chest x-ray in the meantime to detect an intestinal perforation? Somehow along the way, our medico-legal environment has conflated required standard of care with best practice, the equivalent of the 16-bit number, when what is needed (or what we can afford) is really the 8-bit product.
We need to learn that “good enough” care is what we should be delivering most of the time, especially in situations where resources are limited and public funds are used. Of course, when one is paying for their own care out of their own pockets and if they can afford it, they can ask for best practice care all the time. Elon Musk and Jeff Bezos can ask for and pay for best practice care all the time. But in reality, most of the time and for most people, “good enough” care is all that the person or the system can afford.
This can also be seen in how we choose our healthcare IT systems. Do we have to choose the most comprehensive (read: expensive) system with all the bells and whistles that costs not just an arm and a leg but all four limbs to implement and maintain? When most of the time, these additional features are either not required or used at all? Why should we choose the most “perfect” IT system for our hospitals? Could we not have settled for less, i.e. settled for an 8-bit product and not 16, and maybe tried to stretch the output to a 32-bit after we are familiar with the system? Could we not have chosen a good-enough system instead of the best system?
Lesson 3: Prioritise your work
Not mentioned in the aforesaid NYT or ST article but mentioned elsewhere is that DeepSeek uses a new way of prioritizing data which uses far less memory space than older methods. This is known as Multi-head Latent Attention (MLA) as opposed to the traditional Multi-head Attention (MHA) method. MLA has been demonstrated to use only 5 to 13% of what MHA uses and in doing so, allows for far more efficient training and deployment. The multiplications we mentioned earlier results in much data produced. These data are stored in the form of fundamental data structures known as Key-Values, (KVs) which are then stored in the memory cache.
MLA allows for low priority KVs to be compressed into what is known as latent vectors and in doing so, MLA reduces the KV cache size dramatically. When these low priority KVs are needed, they are decompressed again for use.
Sometimes in healthcare, we attempt too many things at once. Our in-trays (physical or virtual) are loaded to the brim with different things that demand our attention at the same time. They can range from service requirements to teaching responsibilities to research projects. The myriad of demands we make on the system and on our healthcare professionals, in the end creates so much complexity and consumes so much attention that the system slows down or even gets paralysed.
Another good example is how we structure our subsidy system with layers and layers of schemes that makes things so complex that our hospitals’ IT and billing systems cannot cope. The result is slower and unsatisfactory performance of both the staff and the IT systems.
We could perhaps look at all the balls we are trying to juggle in the air and prioritise the work. Schemes that have marginal impact would be merged or even dispensed with altogether. Focus on only the few things that matter. Often, a person or an institution cannot be good at all things all at once. Less important things need to be compressed and cached, maybe even disposed.
If your institution’s waiting time is now a year, perhaps it is time to focus on service delivery and minimize other non-essential stuff. Getting your doctors to run ad-hoc clinics can help in the short run, but it may not help in the long-term, as job satisfaction decreases and more people quit, leaving the organization in a vicious cycle of attrition and more work. It is far better to prioritise your work (and your people) and cut back on the non-service delivery work. Compress and cache these non-essential work for now.
The above are just three simple examples of how we can learn from DeepSeek. There are many others. The underlying principle of why DeepSeek is revolutionary is that its developers experimented with solutions to real problems and obstacles. The solutions they tried are not just incremental in nature or doing more of the same thing. By many accounts, most of the folks who worked on DeepSeek were young people fresh out of college and they looked at things with a fresh perspective. They undoubtedly experimented many times and failed but by thinking out of the box, they came up with something that was faster, better and far cheaper that what had come before them.
Likewise, healthcare system planners should be bold and not think of doing more of the same, because seeking out and getting incremental change is just not good enough anymore.