How DeepSeek R1 Was Trained: A Simple Guide

Blog post description.

1/22/20251 min read

white concrete building
white concrete building

My post content

DeepSeek AI has created a new AI model called DeepSeek R1 that's really good at solving complex problems. Let's break down how they built it in simple terms.

What's New About Their Approach?

DeepSeek used a new method called Group Relative Policy Optimization (GRPO) to train their AI.

Think of GRPO like a teacher who:

  • Gives the same question to multiple students

  • Looks at all their answers

  • Compares them to find out which approaches worked best

  • Uses this information to help everyone improve

This is different from older methods because it's simpler and uses less computer memory while still being effective.

The Training Process

DeepSeek trained their AI in four main steps:

They started by showing their AI lots of well-written solutions to problems. This is like giving a student good examples to learn from.

Step 1: Learning from Examples
Step 2: Practice Makes Perfect

They then had the AI solve lots of math and coding problems. The AI got feedback on its answers and learned from its mistakes. They made sure the AI stuck to using one language at a time to avoid confusion.

Step 3: Creating More Practice Material

The team created a huge collection of practice problems. They used their AI to generate new examples and had another AI check if these examples were good enough to use for training.

Step 3: Creating More Practice Material

The team created a huge collection of practice problems. They used their AI to generate new examples and had another AI check if these examples were good enough to use for training.