TY - GEN
T1 - People cannot distinguish GPT-4 from a human in a Turing test
AU - Jones, Cameron Robert
AU - Rathi, Ishika
AU - Taylor, Sydney
AU - Bergen, Benjamin K.
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/6/23
Y1 - 2025/6/23
N2 - AI systems that can fool people into thinking that they are human could have widespread social and economic consequences. In order to measure this ability, we evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, significantly more often than ELIZA (22%) but less often than actual humans (67%). In order to test the generalizability of our results, we replicated the study on a second population (undergraduate students) and found that the same prompt with GPT-4o achieved a pass rate of 77%, slightly higher than the human pass rate of 71%. On some interpretations, the results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence. We release the full transcripts of the replication data to enable further investigation of human-AI interaction dynamics and deception.
AB - AI systems that can fool people into thinking that they are human could have widespread social and economic consequences. In order to measure this ability, we evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, significantly more often than ELIZA (22%) but less often than actual humans (67%). In order to test the generalizability of our results, we replicated the study on a second population (undergraduate students) and found that the same prompt with GPT-4o achieved a pass rate of 77%, slightly higher than the human pass rate of 71%. On some interpretations, the results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence. We release the full transcripts of the replication data to enable further investigation of human-AI interaction dynamics and deception.
KW - Turing test
KW - deception
KW - human-AI interaction
KW - interactive evaluation
KW - large language models
KW - sociotechnical safety
UR - https://www.scopus.com/pages/publications/105010833893
U2 - 10.1145/3715275.3732108
DO - 10.1145/3715275.3732108
M3 - Conference contribution
AN - SCOPUS:105010833893
T3 - ACMF AccT 2025 - Proceedings of the 2025 ACM Conference on Fairness, Accountability,and Transparency
SP - 1615
EP - 1639
BT - ACMF AccT 2025 - Proceedings of the 2025 ACM Conference on Fairness, Accountability,and Transparency
PB - Association for Computing Machinery, Inc
T2 - 8th Annual ACM Conference on Fairness, Accountability, and Transparency, FAccT 2025
Y2 - 23 June 2025 through 26 June 2025
ER -