Author Information1:
Li, Janice2; Shi, Patrick3; Yang, Christopher4
(Editor: Wang, Zifu5)
1 All authors are listed in alphabetical order
2 Whitney High School, CA, 3Thomas Jefferson High School for Science and Technology, VA, 4The Lawrenceville School, NJ, 5George Mason University, VA
Background:
Prompt engineering is crucial in guiding Large Language Models (LLMs) so that they produce relevant and accurate responses. Studies have been conducted to explore different frameworks for prompt engineering [1]. One study created the EHR-CoAgent framework to create disease predictions based on Electronic Health Records (EHR) using GPT-4. The EHR-CoAgent framework consists of a predictor agent that makes initial disease predictions based on the input EHR data and provides reasoning to support its decisions and a critical agent that reviews the predictions made by the predictor, identifies errors, and provides feedback to train the model. In addition, frameworks such as Zero-Shot prompting help LLMs perform tasks without specific examples. Conversely, few-shot learning provides examples to aid LLM performance. Compared to Zero-Shot+ (zero-shot with additional prompting) and few-shot prompting strategies, the EHR-CoAgent was the most successful [2]. Another study explored the use of Chat-GPT and prompt design to determine (History, ECG, Age, Risk Factors, Troponin Risk Algorithm) scores. Their final prompt design includes features that they added to their prompts after determining they increased the accuracy. For example, utilizing different prompts when asking for history, ECG, and risk factors subscores had better outcomes compared to only using one prompt and asking Chat-GPT to explain its steps rather than only outputting the final answers, which also resulted in better outcomes [3]. These studies depict that prompt engineering strategies are crucial in creating accurate and relevant responses when using LLMS to predict health outcomes. Recent advancements have led to the use of LLMs to predict health outcomes. For example, the Health-LLM framework can combine large extraction and knowledge trade-off scoring to provide personalized health predictions based on data from reports. This system is better than traditional techniques in disease prediction accuracy and F1 scores, as evidenced by Health-LLM’s accuracy of 0.833 and an F1 score of 0.762 that outperformed traditional models such as logistic regression [4]. In addition, researchers used a HeLM framework to incorporate specific, individualized data for disease risk estimation. They found promising results with the HeLM, showing a higher AUROC when combining tabular and spirogram data compared to using tabular data alone, supporting HeLM’s to sift through various media formats [5]. Furthermore, LLMs such as GPT offer the advantage of feeding context to a pre-trained learning model. These LLMs can perform few-shot learning with less data by omitting steps taken to learn from previously formed data. [6] When tested against traditional models, GPT saw comparable prediction capabilities. For example, GPT achieved an accuracy (as measured by AUROC) of 0.725, similar to the AUROCs of 0.733 for the ACC/AHA model and 0.728 for the Framingham model [6].
Objective:
The study will investigate the use of LLMs to predict one-year hospitalization rates in liver transplant recipients.
Methods
In total, we tested the efficacy of two LLM models, Llama-3 Instruct and Hermes 2 Pro Mistral, in their abilities to predict liver transplant recipient hospitalization after one year. Deidentified patient and transplant donor data sourced from the UNOS database were fed to the LLMs. This dataset contained records such as the donor’s and patient’s age, ethnic background, and medical history, as well as the condition of the transplanted organ. We prompted the LLM with a few-shot approach that enforced an all-or-nothing framework. The LLM outputs either 1 (yes) or 1 (no) to the question of whether a patient would be hospitalized after one year. The LLM was provided with 11 variables given in the database with the two LLM models, comparing their accuracy rates. A one-proportion z-test was performed to see if the accuracy rates were statistically better than random guessing (50%). In addition, we modified prompts such that LLMs would calculate hospitalization probabilities. We employed a Chain-of-Thought prompting technique with the LLM by including responses generated by Mistral, which contained rationales behind predicting a set of sample patients’ hospitalization probabilities. ROC curves were produced to assess model accuracy. To compare our results with those from another source, we created a supervised machine learning model. Using the RandomForestClassifier from Sci-kit Learn’s library, we created a decision tree model trained on a 90/10 split of the liver patient data. The X variable was assigned to all the patient’s variables mentioned earlier except for hospitalization rates, and y was set to the target variable, hosp1Y, or hospitalization rates. The random forest classifier was used on 10% of the data, around 1450 patient IDs.
Results
1. Few-shot Prompting
Table. 1 Results for Llama 3 Instruct (n=68) and Hermes 2 Pro Mistral (n=90)
Model | Accuracy Rate | Sensitivity | Specificity | P-value |
Llama 3 Instruct | 0.588 | 0.829 | 0.222 | 0.070 |
Hermes 2 Pro Mistral | 0.596 | 0.700 | 0.462 | 0.032 |
2. Chain-of-Thought (CoT) Prompting
Figure 1. ROC curve for Llama-3 (n = 200).
Area Under the Curve (AUC): 0.476
Figure 2. ROC curve for Heremes 2 Pro Mistral (n = 200).
Area Under the Curve (AUC): 0.503
3. Decision Tree Classification Model
Accuracy: 0.7497
Precision: 0.7498
Recall: 0.9991
F1: 0.8567
Figure 3. Feature Importance for 30 Patient and Donor Variables
Figure 4. Pearson Correlation Matrix of Top 10 Feature and Target Variable
Conclusion
For few-shot prompting with all-or-nothing responses, Llama 3 Instruct’s performance was not significantly better than random guessing (p = 0.070 > 0.05). However, Hermes 2 Pro Mistral’s performance was (p = 0.032 < 0.05). Chain-of-thought prompting with probability responses yielded poor accuracy. The AUC scores of 0.476 and 0.503 show the LLMs offer no advantage to random guessing. Because the chain-of-thought methodology relies on complex thinking, there is potential for selection bias in the types of examples used. For example, including diabetes as a key risk factor for hospitalization may cause the LLM to overestimate the role of diabetes in patient health. These results suggest that our LLM models were better suited for binary classification than probability classification. Previous studies indicate that the simplicity of binary classification allows LLMs to focus on key features between two classes. In contrast, logistic regression requires the assessment of many combinations of features to predict continuous probabilities, thus requiring larger computing power [7].
The ML model’s high recall accuracy (99.9%) measured the proportion of true positives among the total number of actual positives, implying that it is very good at identifying all the correct positive instances. The accuracy, (74.9%), scored well compared to the prediction of the different LLMS we used. This implies that decision trees, in particular random forest classifiers, are particularly efficient and accurate in predicting binary outcomes based on a set of variables. Several options could have been taken to further improve our accuracy and precision in the future. This includes importing and utilizing the Optuna library. This will search for the combination of hyperparameters that maximizes the model’s accuracy. By systematically exploring these hyperparameters, such as max depth of the decision tree, the minimum sample split, minimum sample leaves (where each decision is made), the process aims to find a configuration that improves the model’s performance compared to default settings. In addition, we could have created and tested other ML models, such as Gradient boosting or Neural Networks. These methods could also provide insight on feature importance and overall improve the prediction accuracies. The top 8 most important features in determining patient hospitalization were donor BMI at the time of transplant, BMI of the donor, cold ischemic time, age of the recipient, number of days spent on dialysis, age of the donor, creatinine level of donor (kidney function), and MELD score at the time of transplant. BMI is a proxy for many other health conditions, such as obesity, diabetes, and high blood pressure. Increased BMI increases the risk of non-alcoholic fatty liver disease in the recipient [8]. In addition, previous studies revealed poor donor kidney function and extensive dialysis duration before transplant significantly affected post-operation success [9].
Through our research, we have concluded that, at the moment, LLMs are not suitable for predicting the hospitalization rates of liver transplant patients within one year. In addition to the low accuracy rates, we came across many limitations to running the LLM. For example, running the LLM requires a high GPU in order for it to run in a timely manner, so factors such as money and time limits the usability of these LLMs. In addition, when attempting the binary prompting, some models would output sentences recounting its thinking instead of only 0/1 which made it difficult to calculate the accuracy rates, and in those cases the models were omitted. However, further research into more LLMs and prompt-tuning strategies may assist in mediating these issues. In conclusion, we recommend further exploration of prompt-engineering techniques with patient data to improve the adaptability of current LLMs to various medical contexts.
References
- Wang J, Shi E, Yu S, Wu Z, Ma C, Dai H, Yang Q, Kang Y, Wu J, Hu H, Yue C. Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670. 2023 Apr 28.
- Cui, Hejie, et al. “LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction.” arXiv preprint arXiv:2403.15464 (2024).
- Safranek CW, Huang T, Wright DS, Wright CX, Socrates V, Sangal RB, Iscoe M, Chartash D, Taylor RA. Automated HEART score determination via ChatGPT: Honing a framework for iterative prompt development. Journal of the American College of Emergency Physicians Open. 2024 Apr;5(2):e13133.
- Jin, M., Yu, Q., Zhang, C., Shu, D., Zhu, S., Du, M., Zhang, Y., & Meng, Y. (2024). Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2402.00746
- Belyaeva, A., Cosentino, J., Hormozdiari, F., Eswaran, K., Shetty, S., Corrado, G., Carroll, A., McLean, C. Y., & Furlotte, N. A. (2023). Multimodal LLMs for health grounded in Individual-Specific data. In Lecture notes in computer science (pp. 86–102). https://doi.org/10.1007/978-3-031-47679-2_7
- Han C, Kim DW, Kim S, You SC, Park JY, Bae S, Yoon D. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. Iscience. 2024 Feb 16;27(2).
- Saarela, M., & Jauhiainen, S. (2021). Comparison of feature importance measures as explanations for classification models. SN Applied Sciences/SN Applied Sciences, 3(2). https://doi.org/10.1007/s42452-021-04148-9
- Alqahtan, S. A., & Brown, R. S. (2023). Management and risks before, during, and after liver transplant in individuals with obesity. PubMed, 19(1), 20–29. https://pubmed.ncbi.nlm.nih.gov/36865816
- O’Riordan, A., Wong, V., McCormick, P. A., Hegarty, J. E., & Watson, A. J. (2006). Chronic kidney disease post-liver transplantation. Nephrology, Dialysis, Transplantation/Nephrology Dialysis Transplantation, 21(9), 2630–2636. https://doi.org/10.1093/ndt/gfl247