You may have already found that ChatGPT can aid your Bible study. Now, ChatGPT is not the only game in town. We have so many options—Antropic, Google, Meta, and Mistral have all released easily accessible AI chatbots.
That begs the question for bible readers: Which AI is the best for assisting bible studies?
In this post, I will detail the result of a Bible knowledge benchmark test of a few popular language AI models.
I will test four popular online AI chat services. For folks who are paranoid about privacy, I also included 2 models you can run locally on your PC.
Without further ado, below are the results.
AI Model | Accuracy |
---|---|
OpenAI ChatGPT 3.5 | 94.0% |
OpenAI ChatGPT 4 | 95.7% |
Google Gemini Pro | 89.5% |
Anthropic Claude 3 Sonnet | 93.5% |
Meta AI Llama 3 8B (local model) | 73.8% |
Mistral 7B (local model) | 79.4% |
Contents
Test methodology
I will use the Bible Trivia dataset to test each model. It contains 1,290 Bible trivia questions of varying difficulty.
Here are some sample questions.
What is the first book in the Bible? (Answer: Genesis.)
What was most likely the first of Paul’s letters written? (Answer: 1 Thessalonians)
Who bought Joseph? (Anwswer: Potiphar, captain of the Pharaoh’s guards)
The same prompts are used to test each model. Due to the repetitive nature of the test, I use the APIs to query the language model programmatically.
The answer doesn’t need to be exactly the same to be called correct. It’s correct if it means the same or contains the answer. A separate API call to ChatGPT 3.5 is used to compare if the answer semantically means the same thing.
If you look at the dataset closely, you will find a few wrong answers and ambiguous questions. The real accuracies are likely to be higher. Pay attention to the relative scores between models instead of the absolute values.
Model evaluations
OpenAI: ChatGPT 3.5 and 4
Tested versions:
- ChatGPT 3.5: turbo 0125
- ChatGPT 4: turbo 2024-04-09
OpenAI is the first company focused on scaling up large language models. It’s no surprise their models take the top spots.
As of April 2024, OpenAI’s ChatGPT series are the best-performing models in bible knowledge, with ChatGPT-3.5 scoring 94.0% and ChatGPT 4 scoring 95.7%.
Looking at ChatGPT 4’s incorrect answers, it is not clear they are incorrect. They are either ambiguous questions or answers that are too detailed compared to the truth in the dataset.
For example:
What did Samson carry to the top of the hill overlooking Hebron?
The dataset’s answer is “City gates (Jdg 16:3).” ChatGPT 4’s answer is “Samson carried the gates of Gaza to the top of a hill facing Hebron,” which means the same thing but is more specific.
What questions did ChatGPT 3.5 answer incorrectly but ChatGPT 4 answered correctly? Below is an example.
What is the shortest book in the Bible?
True answer: 3 John
ChatGPT 3.5: Book of Obadiah, which has only one chapter.
ChatGPT 4: The shortest book in the Bible by word count is 3 John.
It appears that ChatGPT 3.5 confuses the Bible with the Old Testament.
So, if you have a ChatGPT subscription, it is worth switching to ChatGPT 4 for more accurate answers.
Google Gemini
Model tested: Google Germini Pro 1.0
As of writing, Google is playing catch up in building large language models. Based on my experience using Gemini, it’s no surprise it is trailing behind, scoring only 89.5%.
Here’s an obvious mistake:
Who did Abraham have his second son with?
True answer: Sarah.
Gemini: Hagar.
As a side note, Gemini’s API is the hardest to use among all tested. It threw safety errors when nothing was unsafe. (We are doing Bible trivia, after all!)
Anthropic Claude
Tested version: Anthropic Claude 3 Sonnet (20240229)
Anthropic is a respectable rival to OpenAI. It was started by former OpenAI employees over disagreement on AI safety.
The tested Claude 3 Sonnet model is the one you would access with their free online chat. It scored a respectable 93.5%, slightly lower than OpenAI counterpart ChatGPT 3.5.
Meta AI Llama 3
Llama 3 is an open-source AI model you can run locally on your machine. It’s ideal for people who have privacy concerns about how these online companies going to store and use your chat history.
Llama 3 models come in two sizes: 8B and 70B parameters. For practical consideration, most bible-studying folks can only run the 8B model locally due to hardware constraints. So I am only going to test the smaller 8B model.
Note that the 8B model is much smaller than the 175B of ChatGPT 3.5. So, you can expect a performance drop.
It scores 73.8% on the test. It’s not bad for its size, but you probably don’t want to trust what it says about the bible.
Mistral
Model tested: Mistral 7B (03/23/2024 release)
Mistral AI is a strong player in the field of large language models. They have innovated new model architectures like a mixture of experts to improve the performance.
Their open model Mistral 7B is a bit smaller than Meta AI’s Llama 3 (8B). Surprisingly, it performs quite a bit better at answering 79.4% of the questions correctly.
While I won’t say it’s usable for Bible studies, with some further improvements, perhaps one day we can use local models.
Conclusions and recommendations
OpenAI models are clear winners in this Bible trivia challenge, while Anthropic is the close second. For the best accuracy, I recommend using ChatGPT 4 as your Bible study companion.
If you have to stick with free AI models, both ChatGPT 3.5 and Claude 3 are fine choices.
For now, don’t use local models like Llama 3 8B and Mistral 7B. They don’t perform well enough.
2 thoughts on “Which AI knows the Bible? Challenging ChatGPT, Gemini, Claude, Llama, Mistral with Bible trivia”
God bless Andy Wong for doing this research and sharing it with the world. Absolutely relevant to the Bible student.
Thank you!