A few days ago I had a little experiment with ChatGPT- I turned off the “Web search” feature in the personalization settings, cutting off GPT-5 from the Internet and forcing to use its own weights for everything. I then asked a simple question: “Who is the president of the United States?”. What GPT-5 returned surprised me:

What is going on? GPT-5’s knowledge cutoff is June 2024, before the results of the 2024 presdential election came in. It’s easy to brush it off as an LLM quirk based on its outdated knowledge - but if you think about it, it really should know better: it should know that there will be the presidential election in late 2024, and it also knows today’s date (from OpenAI’s sytem prompt for ChatGPT), so it should just admit that it doesn’t know. Instead, it replies with a factual lie.
To be clear- the Thinking model fares much better- after 12 seconds, it tells me it can’t check the latest info since web browsing is disabled.

But still, I really don’t think it’s OK that the latest and greatest GPT-5 base model, which cost hundreds of millions of USD just to train (allegedly) and is also the result of so many computer scientists’ hard work and literally billions of USD of accumulated R&D, should commit such a simple mistake. This got me into sleuthing mode- how good is the latest GPT-5 model against simple questions like this, when it doesn’t have access to the Internet?
I didn’t want to start so many chats from the web UI, so I made a very simple Python script to test it out- it just calls the ChatGPT API 100 times in parallel with the same question (e.g. “Who is the president?”) and records the answers to a CSV file. I also created a minimal system prompt so it knows the current date: Answer the question as accurately and concisely as possible. Today is November 07, 2025 (the date is programatically set to the current date). I use the gpt-5-chat-latest model unless stated otherwise, which is the model used in the ChatGPT website. (I did notice less hallucinations with the gpt-5 model, but cost way too much more- more on that later)
After collecting the “survey” samples from the LLM, I didn’t want to tally them by hand so I use another API call to sort the answers into predefined categories (including a “ambiguous” category added by default to sort answers that can’t be categorized- “I don’t know” would also be categorized here).
Also, here is a good time for me to say that this is of course a quick hobby investigation and not academic in any sense - or statistically rigorous. Please take the analyses and claims here all with a big grain of salt. (I still do think this is pointing to a serious problem!)
So, here’s what 100 LLMs answer for “Who is the current president of the United States of America”:
(add results)
(add “we did it joe” meme)