IJIET 2025 Vol.15(3): 629-639
doi: 10.18178/ijiet.2025.15.3.2271
doi: 10.18178/ijiet.2025.15.3.2271
ChatGPT-4o’s Reasoning Performance on Two-Tier Test of Static Fluid
I Komang Werdhiana1, Sarintan Nurcahyati Kaharu2, Rahmad Tule3, and Jusman Mansyur1,*
1. Physics Education Department, Tadulako University, Palu, Indonesia
2. Elementary School Teacher Education Department, Tadulako University, Palu, Indonesia
3. State Junior High School 2 Ampana Kota, Ampana, Indonesia
Email: komangwerdhiana@untad.ac.id (I.K.W.); sarintankaharu@untad.ac.id (S.N.K.); tulerahmad@gmail.com (R.T.); jusman_mansyur@untad.ac.id (J.M.)
*Corresponding author
2. Elementary School Teacher Education Department, Tadulako University, Palu, Indonesia
3. State Junior High School 2 Ampana Kota, Ampana, Indonesia
Email: komangwerdhiana@untad.ac.id (I.K.W.); sarintankaharu@untad.ac.id (S.N.K.); tulerahmad@gmail.com (R.T.); jusman_mansyur@untad.ac.id (J.M.)
*Corresponding author
Manuscript received October 24, 2024; revised November 14, 2024; accepted December 23, 2024; published March 20, 2025
Abstract—This study examined the reasoning performance of ChatGPT, specifically ChatGPT-4o, using a two-tier test in the context of static fluid. ChatGPT-4o’s performance was compared to that of students from various educational levels. The study involved 61 new chats with ChatGPT-4o, 105 junior high school students (from two grade levels), 132 high school students (from two grade levels), and 201 university students majoring in physics education (across four academic years). Data collection utilized a two-tier test consisting of 25 items administered to the ChatGPT-4o sample through a prompting process with the Artificial Intelligence (AI) system, as well as an online two-tier test for the student respondents. Data analysis employed a quantitative approach to evaluate reasoning performance scores across all respondents and a qualitative approach, incorporating phenomenographic analysis, to study ChatGPT-4o’s reasoning behaviour. The analysis revealed that ChatGPT-4o’s performance in answering questions (Tier-1) was lower than that of the students. However, it outperformed the students in providing justification or reasoning (Tier-2). On paired items, ChatGPT-4o also demonstrated superior performance compared to the students. Overall, the reasoning performance of both ChatGPT-4o and the students was categorized as low. The outcome space derived from the phenomenographic analysis identified the following categories for ChatGPT-4o’s reasoning behaviour: reasoning based on formula; consistency in reasoning pathways; ability to reconcile with alternative ideas; context-dependent reasoning abilities and difficulties; and tendencies to provide biased or contradictory reasoning or explanations. Therefore, it is concluded that ChatGPT-4o still requires further refinement and database enhancement, particularly for cases related to static fluid available on the internet.
Keywords—Artificial Intelligence (AI), ChatGPT-4o, reasoning, static fluid, two-tier test
Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
Keywords—Artificial Intelligence (AI), ChatGPT-4o, reasoning, static fluid, two-tier test
Cite: I Komang Werdhiana, Sarintan Nurcahyati Kaharu, Rahmad Tule, and Jusman Mansyur, "ChatGPT-4o’s Reasoning Performance on Two-Tier Test of Static Fluid," International Journal of Information and Education Technology, vol. 15, no. 3, pp. 629-639, 2025.
Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).