Author Information1
Kim, Rayna2 ;Pei, Celine3 ;Riotto, Allen4.
(Editor: Wang, Zifu5)
1 All authors are listed in alphabetical order
2 Thomas Jefferson High School for Science and Technology, VA, 3Polytechnic School, CA, 4St. Paul VI Catholic High School, VA, 5George Mason University, VA
Background
Large Language Models (LLMs), such as ChatGPT, LLaMA, and Gemini, use deep learning to understand and generate human-like tests and have the potential to be applied in a wide range of industries. More recently, LLMs have been incorporated into the medical field to make medical diagnoses, analyze clinical data, and perform various other healthcare-related tasks [4][7]. There are different types of LLMs with various roles. The zero-shot technique utilizes prompt strategies such as chain-of-thought reasoning to predict health outcomes through analysis of pre-existing knowledge related to diseases, medications, and procedures [4]. Other prompt strategies include active retrieval augmented generation, which fetches documents according to the user’s input and then generates a comprehensive response based on the retrieved documents, and chain-of-thought, which is a language model that generates short sentences imitating the problem-solving process of a person tackling the task. [7][9] The few-shot technique learns from examples and applies its knowledge to adapt to new tasks. Research by Cui et al. (2024) demonstrates that zero-shot or few-shot LLMs tend to err on the side of making positive diagnoses rather than making false negatives. However, their research also proposes and demonstrates that EHR-CoAgent LLMs perform significantly better than traditional machine learning models and other prompt strategies. EHR-CoAgents, which fuse the abilities of different LLMs, utilize the prompt strategies predictor agents, critic agents, and instruction-enhanced prompting. Predictor agents use the few-shot technique to generate health predictions based on electronic health records (EHRs). The critic agent, through its knowledge of incorrect predictor agent outputs, analyzes the predictions and feeds its feedback to the predictor agent. This process occurs multiple times. Lastly, instruction-enhanced prompting helps to incorporate the critic agent’s feedback by altering prompts for the predictor agent. The EHR-CoAgent, with its integration of various LLM prompt strategies, demonstrates significantly better outcomes compared to other prompt strategies. Another study conducted by Han et al. (2024) analyzing GPT-4’s capabilities in cardiovascular disease (CVD) risk scoring found GPT-4’s performance in risk prediction is comparable to conventional mathematical risk prediction models such as the Framingham risk score and the American College of Cardiology American Heart Association (ACC/AHA) risk score [5]. Even with incomplete clinical data with missing key variables, GPT-4 was capable of assessing risk scores. GPT’s algorithm allows for flexible inputs which is key to decision making in this field where full datasets may not be available. When analyzing the adaptation of the GTP-4 to other ethnic groups (specifically the UK Biobank and KoGES datasets), the model displayed similar tendencies with varying demographics and clinical characteristics. Moreover, the accuracy of the GTP model soared from 33.3% to 86.1% in nearly 30 months, with this trajectory outperforming 10-year CVD risk detection models. Not only can LLMs be used to aid in the creation of risk-prediction models, but they have also been shown to assist in tasks such as: “drafting medical documents, creating training simulations, and streamlining research processes” [10]. Although the results look promising, a few potential issues may arise. Since the training corpora of the GTP-4 and other LLMs are not transparent, there is a possibility of bias against minority groups and uncertainty regarding whether these AI systems will adhere to the ACC/AHA guidelines or established medical literature. Because the current model is based on a per-token pricing model (for both input and output), running tens of thousands of models may not be cost-effective for clients. A potential solution for this issue may be to implement token prioritization and change the tokenization to diminish the burden of cost on the GPT system [5][3]. Because LLMs make decisions on probabilistic algorithms, identical prompts occasionally yield varied responses, leading to inconsistent answers. In a study done on self-diagnoses using LLMs by Balaasubramanin and Dakshit (2024), multiple runs of various LLMs with the same information often led to inconsistent diagnoses [2].
Objective
This study aimed to find the most successful LLM prompt strategy for health outcome predictions and compare the functioning of various LLMs in clinical predictions.
Methods
Clinical data from kidney transplant patients from the United Network for Organ Sharing (UNOS) were used to train the LLM. Four prompting strategies were considered while creating the LLM: zero-shot, few-shot, chain-of-thought, and tree-of-thought. To determine the optimal prompt strategy, specific phrases were included in the instructions to the LLM. In zero-shot prompting, the LLM was only provided with data and asked to make a prediction. No examples or further explanations were given. In few-shot prompting, only a few examples of inputs and outputs were provided. Chain-of-thought included the key phrase “Let’s think step by step”, which allowed the LLM to reason through the predictions using a problem-solving process. Tree-of-thought included branching paths of reasoning, enabling the LLM to navigate through different branches of thought to generate a prediction based on the initial prompt. The program sklearn was incorporated into the code to create tree-of-thought prompting. The expected output from the code would be either 0, indicating no rehospitalization after transplant within one year, or 1, indicating rehospitalization after transplant within one year. After determining the optimal prompt strategy, we analyzed the functioning of four different LLM versions with the same set of programming and then calculated each of their accuracy rates. In experimenting with both prompt strategies and LLM versions, a temperature of 4.5 was used because higher temperatures generated results with ± 3% accuracy rates.
Results
Tree-of-thought was determined to be the most optimal prompt strategy compared to zero-shot, few-shot, and chain-of-thought prompting. The LLM was not provided with enough data to generate a reasonable response when zero-shot prompting was implemented, and it guessed randomly without explanation. Because our study used an LLM rather than a machine learning model, training data was not required, preventing us from properly utilizing few-shot prompting. Attempts to implement this type of prompting resulted in the code generating strings of text instead of “0” or “1”, wasting a considerable amount of time. Chain-of-thought prompting rendered the LLM unable to establish connections based on the clinical data provided, fixating on the most impactful variable instead of considering interactions between all 32 variables. Implementing the tree-of-thought prompt strategy, Meta-Llama-3-8B-Instruct.Q3_K_S.gguf LLM (Version C) demonstrated the highest accuracy rate of 57.30%, while LLM ggml-model-Q3_K_S.gguf (Version B) achieved the lowest accuracy rate of 45.15%. The latter took the longest time because it did not follow the code and generated strings instead of producing “0” or “1”. We had to manually adjust the outputs to “0” or “1” based on its responses (Table 1).
Table 1. LLMs Comparison
Version | LLM VERSION | Time Taken (s) | Rows Created | Time (s)/Row | Accuracy Rate |
A | Meta-Llama-3-8B-Instruct.Q3_K_M.gguf | 2418.8 | 1815 | 0.75 | 53.88% |
B | ggml-model-Q3_K_S.gguf | 854.2 | 1001 | 1.17 | 45.15% |
C | Meta-Llama-3-8B-Instruct.Q3_K_S.gguf | 819 | 1103 | 1.35 | 57.30% |
D | Meta-Llama-3-8B-Instruct.Q3_K_L.gguf | 809.9 | 1001 | 1.23 | 54.55% |
Conclusion
When determining how the sklearn tree-of-thought prompting worked, it was discovered the LLM generated its prediction based on variables that the algorithm interpreted as having the greatest effect on the rehospitalization rate of patients. The algorithm determined that the PRA score at the time of transplant (level of antibodies) was the factor that had the greatest effect on one-year hospitalization rates, followed by Kidney Donor Profile Index (KDPI), number of days spent on dialysis, and the HLA mismatch level. Based on the predetermined risk factors identified by the algorithm, it also determined that the risk factor a patient possessed had a greater impact on the rehospitalization rate than the risk factor a donor possessed. For example, even though older age is a risk factor, an older patient and younger donor pairing would have a higher rehospitalization rate than a younger patient and younger donor pairing.
Age and race are highlighted as two of the most important risk factors related to rehospitalization [6]. However, it is interesting to note that although age and race did affect the LLM’s prediction of one-year rehospitalization of patients, their impact was relatively minor compared to other variables, ranking 6th and 17th out of the 32 variables in terms of importance according to the algorithm. We speculate that because the calculation of KDPI (ranked 2nd most important by the algorithm) already considers a variety of donor-related risk factors, including age and race, the algorithm ranks the individual variables of age and race lower in importance [1]. This also explains why the algorithm generally ranked variables of importance in rehospitalization differently from previous studies. Within the variable of race (calculated by the algorithm), Black patients were the most likely to be hospitalized one year after their kidney transplant, followed by White patients, Hispanic patients, and Asian patients.
Although our study demonstrates the capabilities of LLMs to predict patient rehospitalization outcomes based on clinical data, there are a few open issues. The trained LLM and code used in this paper were developed within a three-week timeframe and with limited funding, whereas LLMs created in other studies benefited from significantly more time and funding for their development. This could explain why our trained LLM only achieved an accuracy rate between 45.15% and 57.30%, which contrasts with other LLMs that consistently achieve around an 86.1% accuracy rate. Another factor that could influence the accuracy of our LLM is the scale of the parameters used in this experiment. While the Llama-3-8B LLMs we used have 8 billion parameters, current commercially available systems such as the GTP 3.5 and 4 models (with 175 billion and 1 trillion parameters respectively) can process data and make connections between variables at a rate much faster and more intelligent than the Llama-3-8B LLMs [8]. Future studies addressing these issues would greatly assist in creating a trained LLM capable of producing results that could positively impact efficacy and efficiency in kidney transplantation.
References
1. Zhang, K., Yan, X., & Meng, X. (2024). Cell. In The Application of Large Language Models in Medicine: A Scoping Review. Relx. https://doi.org/10.1016/j.isci.2024.109713
2. Bachmann, Q., Haberfellner, F., Büttner-Herold, M., Torrez, C., Haller, B., Assfalg, V., Renders, L., Amann, K., Heemann, U., Schmaderer, C., & Kemmner, S. (2022). The Kidney Donor Profile Index (KDPI) Correlates With Histopathologic Findings in Post-reperfusion Baseline Biopsies and Predicts Kidney Transplant Outcome. Frontiers in medicine, 9, 875206. https://doi.org/10.3389/fmed.2022.875206
3. Balasubramanian, N. S. P., & Dakshit, S. (2024). Can Public LLMs be used for Self-Diagnosis of Medical Conditions?. arXiv preprint arXiv:2405.11407.
4. Bhattacharya, M., Pal, S., Chatterjee, S., Lee, S.-S., & Chakraborty, C. (2024). Large Language Model (LLM) to Multimodal Large Language Model (MLLM): a journey to shape the biological macromolecules to biological sciences and medicine. ASTGC. https://doi.org/10.1016/j.omtn.2024.102255
5. Cui, H., Shen, Z., Zhang, J., Shao, H., Qin, L., Ho, J. C., & Yang, C. (2024). LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction.
6. Han, C., Kim, D. W., Kim, S., You, S. C., Park, J. Y., Bae, S., & Yoon, D. (2024). Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. Iscience, 27(2).
7. Iqbal, K., Hasanain, M., Rathore, S. S., Iqbal, A., Kazmi, S. K., Yasmin, F., Koritala, T., Thongprayoon, C., & Surani, S. (2022). Incidence, predictors, and outcomes of early hospital readmissions after kidney transplantation: Systemic review and meta-analysis. Frontiers in medicine, 9, 1038315. https://doi.org/10.3389/fmed.2022.1038315
8. Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active Retrieval Augmented Generation.
9.Mearian, L. (2024, April 11). What are LLMs, and how are they used in generative AI? Computerworld. https://www.computerworld.com/article/1627101/what-are-large-language-models-and-how-are-they-used-in-generative-ai.html
10. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models.