Author Information1
Kligys, Kate2; Kuppachi, Dhanvi3; Shen, Selena4; Sigurupati, Shriya5; Veliveli, Sashmika6
(Editor: Wang, Zifu7)
1 All authors are listed in alphabetical order
2 Mira Costa High School, CA, 3 PVI Catholic High School, VA, 4 Pine View School, FL, 5 Centreville High School, VA, 6 Freedom High School, VA, 7 George Mason University, VA
Background
Many academic publishing systems, faced with a growing number of research paper applications, have turned to large language models (LLMs) for support in the lengthy peer review process. [1] The phenomenon of affiliation bias, where the perceived prestige of an author’s institution may influence peer review outcomes, has increasingly been a subject of concern. Recent investigations suggest that large language models (LLMs) could either exacerbate or reduce this bias, thus impacting the acceptance rates of medical publications, along with the progress of medical science. [2] Latona et al. found that LLM-assisted reviews inflated paper scores and acceptance rates within the International Conference on Learning Representations (ICLR), raising the validity of LLM assistance as another point of concern. [3]
Regarding the capabilities of LLMs, it is evidenced that LLMs can handle simple tasks with high efficiency and correctness, but inconsistently produce expert-level outcomes to harder tasks such as coding [4].To add on, Kanti et al.’s study that assessed the capabilities of 3 popular LLMs – GPT-3.5, LLaMA2, and PaLM2 – to generate meta-reviews for the peer-review process concluded that while LLMs are capable of generating correct responses to predetermined prompts, they are limited in their ability to compose meta-reviews for academic papers. Comparing the 3 LLMs against each other, GPT-3.5 and PALM2 were rated higher by humans than LLaMA2 in the 5 main aspects tested in the study (core contributions, strengths, weaknesses, suggestions, and missing references). [4] LLMs hold promising potential, but with diverse limitations that need to be addressed. Prompting strategies have also been shown to optimize LLM results. For example, chain-of-thought prompting enables complex reasoning capabilities through intermediate reasoning steps. Other techniques include zero-shot, few-shot, tree of thoughts (ToT), and self-consistency.
Currently, an optimal LLM for the task of academic peer review has not yet been determined. This study evaluates the peer reviewing capabilities of three common LLMs – Llama3, GPT-3.5, and mistral – based on the following parameters: reasoning, runtime, computing resource cost, accuracy, and fairness (based on presence of affiliation bias).
Objective
The main objectives of this study are (1) to conduct a comprehensive comparison of the capabilities and limitations in supporting the peer review process of three common LLMs –Llama3, GPT-3.5, and Mistral– and (2) to rank the fairness of the LLMs based on detection of affiliation bias in generated acceptance rates.
Methods
A total of 40 kidney transplant papers were gathered: 20 from Q1 journals and 20 from Q2 journals. Due to token limits for input into the LLMs, only their abstracts were used to perform a peer review. This study evaluates the peer reviewing capabilities of three common LLMs – Llama3, GPT-3.5, and mistral – based on the following parameters: reasoning, runtime, computing resource cost, accuracy, and fairness (specifically presence of affiliation bias). After analyzing the data comprising Q1 and Q2 kidney transplant journals; the LLMS were prompted to give their review and decision of acceptance or rejection by considering 4 criteria: “Overall Evaluation”, “Strengths”, “Weaknesses”, and “Reasoning”. Testing different prompt tuning strategies on a smaller sample size of 5 per trial was used to determine the most consistently effective prompt. This prompt was based on a combination of the following prompt-tuning strategies: Few-shot CoT (Chain of Thought), Template-Based, and ToT(Tree of Thought). With no standard dataset to use as a benchmark, the study instead relied on the assessments of human reviewers to assign reasoning scores. Each of the LLM-generated reviews were assessed in each of the four criteria by 2 human reviewers using the following scoring system: Strongly Agree = 4, Agree = 3, Neutral = 2, Disagree = 1, Strongly Disagree = 0. Precision and recall scores were also gathered from human reviewers using the following prompts:
(Precision): “While generating review, [LLM] was precise in capturing the [criteria] as highlighted by at least two reviewers.”
(Recall): “While generating meta-review, [LLM] indeed covered all the [criteria] highlighted by at least two reviewers.”
Lastly, for the fairness score, the results of LLM evaluations on papers with affiliation provided and without affiliation were compared. Each transplant paper was labeled as prestigious or non-prestigious based on the “World Top 100 Institutions Rankings 2024” sorted by the number of citations in the last six years. The acceptance rates in these two categories were gathered, which fairness scores were determined from.
Results
There was only a statistical difference between the number of accepted prestigious and non-prestigious articles for GPT-3.5 when affiliations were provided. GPT-3.5 acceptance rates of prestigious articles increased from 68.42% to 94.74% when affiliations were provided, while the non-prestigious acceptance rates fell from 80.95% to 76.19% when affiliations were provided. GPT-3.5’s total acceptance rate rose from 75% to 85% after affiliations were provided. (Figure 1). Overall, Mistral, Llama3, and GPT 3.5 had total acceptance rates of 100%, 97.50%, and 75%, respectively without affiliations provided (Figure 2).
Figure 1: GPT-3.5 With vs. Without Affiliations Provided
Figure 2: Total Acceptance Rates
The RAM usage of Mistral was 4.14 GB, whereas Llama3 was 3.66 GB. This data was unavailable for GPT-3.5 because this LLM runs on its machine (Figure 3). This directly impacts the runtimes of LLM models, resulting in Mistral and Llama3, with Mistral’s runtime being 135.14 seconds and Llama3 with 9646.10 seconds. Due to the limitation and difference in GPT-3.5s computing resource costs, the runtime for this LLM is inconclusive (Figure 4).
Figure 3. Computing Resource Cost of LLM Models (GB)
Figure 4. Runtime of LLM Models (s)
To determine acceptance into a Q1 journal, GPT-3.5 had a prediction accuracy of 75%, scoring higher than Mistral and Llama3, which had 50% and 47.50% prediction accuracies, respectively (Figure 5).
Figure 5. Prediction Accuracy
Figure 6. LLM Reasoning Comparisons
Overall, Mistral achieved the highest ratings in reasoning, being the top score in the total reasoning evaluation (Figure 6), the highest precision and recall (Figure 7), and the highest ratings in each of the four output criteria: Overall Evaluation, Strengths, Weaknesses, and Decision Reasoning (Figure 8). GPT-3.5 consistently scored second, and Llama3 consistently received the poorest ratings.
Figure 7. Precision and Recall Based on LLM Testing
Figure 8: LLM Output Ratings
Evaluation Comparison
It is important to acknowledge that all the analyses and testing of the data were run on different computers and settings with mostly different GPUs/CPUs. This results in different outputs for each human testing the data because of the configuration of each computer. Another limitation is that only 40 (20 for Q1 and 20 for Q2) articles were analyzed and collected for data testing, which isn’t as ideal compared to having approximately 100+ articles to extract data from. In the future, it is ideal to have many more articles from multiple different sites so that the data can be more comprehensive. This traces back to the lack of access to high-quality computers with high GPUs/CPUs, which was the primary reason why 100+ articles couldn’t be used for data testing in the given time, since time was limited.
Conclusion
After the comprehensive comparison analysis performed, it was determined that Mistral was the top performing-model as it had excelled in all the comparison parameters of reasoning, runtime, and fairness, however fell short in terms of computing resource cost and accuracy. Though Llama3 produced the worst accuracy, reasoning, runtime, and close second to worst in computing resource cost, the model exhibited the least affiliation bias along with Mistral. This outweighs the previous parameters due to the critical need for equity when evaluating papers for publication. With no affiliation provided, GPT-3.5 outperformed the other two models in accuracy and placed a consistent close second in reasoning scores. However, when provided with affiliations, it generated the least fair results, as it exhibited a higher tendency to accept prestigious articles and reject non-prestigious ones. Therefore, it can be concluded that Mistral is the LLM most suited to evaluating medical papers and generating reviews without bias, and GPT-3.5 is most susceptible to affiliation bias. Future research directions include further testing with a larger sample size, comparisons between other LLMs, exploring more in-depth prompt-tuning strategies to decrease the acceptance rates of the LLMs, research into ways to increase fairness in LLMs, and additional investigations on the presence of other types of bias such as gender, race, cultural, or geographic bias in LLMs when peer reviewing.
References:
1.Flanagin, Annette, et al. “Guidance for Authors, Peer Reviewers, and Editors on Use of AI, Language Models, and Chatbots.” JAMA, vol. 330, no. 8, Aug. 2023, pp. 702–3, https://doi.org/10.1001/jama.2023.12500.
2. Hosseini, Mohammad, and S. P. J. M. Horbach. Fighting Reviewer Fatigue or Amplifying Bias? Considerations and Recommendations for Use of ChatGPT and Other Large Language Models in Scholarly Peer Review. no. 1, May 2023, https://doi.org/10.1186/s41073-023-00133-5
3. Latona, Giuseppe Russo, et al. “The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates.” ArXiv.org, 3 May 2024, https://doi.org/10.48550/arXiv.2405.02150.
4. Kanti, Shubhra, et al. Prompting LLMs to Compose Meta-Review Drafts from Peer-Review Narratives of Scholarly Manuscripts. arxiv.org/pdf/2402.15589
5. Prompting Techniques. (2024). Prompt Engineering Guide. Retrieved July 8, 2024, from https://www.promptingguide.ai/techniques