The aim of this study was to evaluate the performance of four LLM’s—Chat GPT 4.0, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Llama 3.1–405b—in generating dentistry-related content across four different scenarios. The focus was on assessing inter-rater reliability, understandability, actionability, readability, and response characteristics. The findings indicate notable variations in model performance based on these criteria. Llama 3.1–405b demonstrated superior inter-rater reliability, indicating consistent ratings across raters, but it performed less well in understandability and accountability compared to Chat GPT 4.0.
Based on the recommended 6th to 8th grade reading level by both the American Medical Association (AMA) and the National Institutes of Health (NIH) [1, 10, 11], this range is recommended because many patients have reading skills at or below this level, and health materials above this threshold risk being too complex, potentially limiting comprehension and effective self-care. The results of this analysis showed a mixed performance across the LLMs. Llama 3.1–405b and Claude 3.5 Sonnet were the closest to meeting this recommendation, with one scenario each falling within the 7th to 8th grade range. However, Chat GPT 4.0 and Gemini 1.5 Flash tended to produce content at a higher grade level, for all scenarios, which may make the material more challenging for patients to understand. Readability formulas like Flesch-Kincaid provide quantitative estimates but may not fully capture complexity due to medical jargon or sentence structure. This highlights the importance of human oversight to ensure language is appropriately simple and clear for diverse patient populations. While these models performed well in many aspects, none of them consistently hit the ideal 6th grade level, highlighting the need for human intervention to simplify the content to align with the recommended readability levels.
Our findings highlight notable differences in readability, word count, and sentence structure across the LLMs evaluated. Interestingly, these factors can be influenced by how the prompts are framed. For example, explicitly instructing the models to “use simple and easy words so that a sixth grader can understand” or “limit responses to 100 words” may improve readability and conciseness. Such strategies are valuable for tailoring LLMs outputs to different audiences or scenarios, especially in health communication or patient education contexts. Future work could explore systematically how prompt modifications affect readability and length across various models and scenarios.
Patient education materials should be clear, concise, and easily understandable to ensure effective communication [12]. Key features include simple, non-technical language that is accessible to a wide range of literacy levels, along with a logical structure that guides the reader through the content [13]. Visual aids, such as diagrams, infographics, or images, are crucial in enhancing understanding and providing clarity for complex medical concepts [14]. Actionable steps or instructions should be prominently highlighted to help patients follow through with care recommendations. Furthermore, the material should be culturally sensitive and tailored to the patient’s specific needs, ensuring that it resonates with their background and health conditions [15, 16]. It should also include clear contact information for further questions or assistance, fostering patient engagement and empowerment. Lastly, materials should be visually appealing, with a clean layout and ample white space to make it easy for patients to navigate and focus on important information. The responses received from all four models included in this study did not include any images, infographics, or visual representations primarily because these models are designed to generate and process text-based content only. While they excel at providing written responses, they are not inherently equipped to produce or interpret visual elements like images or diagrams [17]. However, it is important to note that ChatGPT 4.0 does have the capability to generate images in some contexts, depending on the platform and settings used. Despite this, the models remain focused on generating human-readable text for a variety of applications, including healthcare communication, but generally lack the integration of image creation or editing functionalities [17,18,19]. As a result, their output is limited to textual information, making it necessary for human intervention to add visual aids, such as images or infographics, during the final stages of content development, especially for PEMs where visual aids play a crucial role in improving comprehension.
In addition to images, LLMs cannot offer personalized content tailored to an individual’s specific health condition, demographic, or preferences, as they rely on general inputs. To overcome the general-purpose nature of these models and improve their domain specificity, recent efforts have focused on fine-tuning LLMs using approaches such as Retrieval Augmented Generation (RAG). RAG combines LLMs with external knowledge retrieval, allowing models to access up-to-date and specialized information relevant to a user’s query. This method can enhance the accuracy and contextual relevance of generated content in healthcare settings. Batool et al. [20]. demonstrated the use of an embedded GPT model tailored for post-operative dental care, showing improved performance compared to standard ChatGPT. Similarly, Umer et al. [21]. applied RAG-enhanced LLM techniques to transform educational journal clubs, addressing specific learning challenges. Incorporating such domain-adapted models may bridge the gap between generalist LLM outputs and the need for precise, personalized patient education materials. They also lack the ability to generate real-time updates or access live data, meaning that the content may not reflect the most current clinical guidelines or patient outcomes. These models also do not provide clinical decision support, patient-specific instructions, or ensure compliance with local healthcare regulations, making human oversight necessary. Furthermore, LLMs cannot replicate the human element of empathy, which is essential for reassuring patients, nor do they always account for cultural sensitivities or provide reliable citations [22, 23]. As a result, while LLMs can generate informative content, they are not fully equipped to produce dynamic, personalized, and compliant patient information materials without human intervention.
One limitation of the current study relates to the simplicity of the prompts provided to the LLMs. Although identical base prompts were used for all models in our study to maintain consistency and minimize variability due to prompt design, these prompts were intentionally kept basic. It is well-established in the literature that the quality of LLM outputs depends heavily on the quality and specificity of the prompts given [24,25,26]. More complex or detailed prompts could potentially elicit more accurate or nuanced responses from the models [27]. However, we deliberately chose simple prompts to simulate typical real-world scenarios where users may not craft elaborate instructions. This approach reflects practical conditions under which PEMs might be generated by users with limited expertise in prompt engineering. Future research could explore how varying prompt complexity impacts the quality of generated health communication materials.
This study evaluated LLM performance using only four dental scenarios. While these scenarios were chosen for their clinical relevance and diversity—covering preventive care, emergency management, routine post-treatment instructions, and early detection—they represent only a subset of the broad range of patient education needs in dentistry. Consequently, the findings may have limited generalizability to other dental topics or more complex clinical situations. Future research should include a wider variety of scenarios to better assess the comprehensive capabilities of LLMs in dental patient education.
In conclusion, while LLMs demonstrate promising capabilities in generating patient education materials, their current limitations underscore the critical need for human oversight and intervention. Although these models excel at producing coherent text-based content, they generally lack the ability to create visual aids, tailor information to individual patient characteristics, or integrate real-time clinical data. Additionally, LLMs cannot fully replicate essential human qualities such as empathy and cultural sensitivity, which are crucial for effective healthcare communication. Recent advancements, including fine-tuning approaches like RAG, offer pathways to enhance model specificity and relevance in healthcare domains. However, even with these improvements, LLM-generated content should be considered as a supportive tool for healthcare professionals rather than a standalone solution. Ensuring optimal patient understanding and engagement requires continued refinement of these models combined with active human involvement to address their current shortcomings.
link
