Exploring Reinforcement Learning Through Human Feedback: A Deep Dive with Mathematical Derivations and PyTorch Implementation

Posted by

Alfalfa

–

March 1, 2024

Reinforcement Learning from Human Feedback

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. In traditional reinforcement learning, the agent receives rewards or punishments based on its actions. However, in some cases, human feedback can provide more informative signals for learning. In this article, we will explore how reinforcement learning from human feedback can be achieved using math derivations and PyTorch code.

Math Derivations

In reinforcement learning from human feedback, the goal is to train an agent to optimize a policy based on the feedback provided by a human. Let’s consider a simple scenario where the agent is trying to learn to play a game. In this case, the human provides feedback in the form of a binary signal (0 for bad move, 1 for good move).

Let’s denote the feedback provided by the human as y, and the action taken by the agent as a. Our goal is to learn a policy π(a|s) that maps states to actions. We can define a loss function that measures the mismatch between the action taken by the agent and the feedback provided by the human:

loss = -y * log(π(a|s))

By maximizing this loss function, the agent will learn to take actions that are more likely to receive positive feedback from the human.

PyTorch Code

Now, let’s implement reinforcement learning from human feedback using PyTorch. First, we need to define the neural network architecture for our agent:

“`python
import torch
import torch.nn as nn
import torch.optim as optim

class Agent(nn.Module):
def __init__(self, input_size, output_size):
super(Agent, self).__init__()
self.fc = nn.Linear(input_size, output_size)

def forward(self, x):
return torch.sigmoid(self.fc(x))
“`

Next, we can define the training loop:

“`python
agent = Agent(input_size, output_size)
criterion = nn.BCELoss()
optimizer = optim.Adam(agent.parameters(), lr=0.001)

for epoch in range(num_epoch):
optimizer.zero_grad()
output = agent(state)
loss = criterion(output, feedback)
loss.backward()
optimizer.step()
“`

In this code snippet, state represents the state of the environment, feedback represents the feedback provided by the human, and input_size and output_size represent the dimensions of the input and output of the neural network, respectively.

By training the agent using this code, we can effectively learn a policy that maximizes positive feedback from the human, and improve the performance of the agent in the game.

#RL, ai alignment, and, Bottle, deep, Deep Learning, derivations, dive, django, exploring, fastapi,, feedback, flask, generalized advantage estimation, hugging face, human, implementation, Keras, Kivy, Language Models, large language models, learning, machine learning, math, mathematical, ppo, proximal policy optimization, PyQt, PySimpleGUI, python, PyTorch, reinforcement, Reinforcement Learning, reinforcement learning from human feedback, rlhf, scikit-learn, TensorFlow, through, Tkinter, trl, Tutorial, with

Alfalfa

0 0 votes

Article Rating

19 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

@alexandrepeccaud9870

8 months ago

This is great

@SethuIyer95

8 months ago

So, to summarize
1) We copy the LLM, fine tune it a bit with a linear layer, and use the -log(sigmoid(good-bad)) to generate the value function (in a broader context and with LLMs). We can do the same for reward model.
2) We then have another copy of LLM – the unfrozen model, the LLM itself, and the reward model and try to match the logits similar to the value function but also keeping in mind the KL divergence of the frozen model.
3) We also add a bit of exploration factor, so that model can retain the creativity.
4) We then sample a list of trajectory, then consider on running rewards, not changing the past rewards and then compute the rewards, while comparing the rewards with the reward when most average action is taken, to get the sense of the gradient of increasing rewards wrt trajectories.

In the end, we will have a model which is not so different from the original model but prioritizes trajectories with higher values.

@YKeon-ff4fw

8 months ago

Could you please explain why in the formula mentioned at the 39-minute mark in the bottom right corner of the video, the product operation ranges from t=0 to T-1, but after taking the logarithm and differentiating, the range of the summation becomes from t=0 to T? 🙂

@tk-og4yk

8 months ago

Amazing as always. I hope your channel keeps growing and more people learn from you. I am curious how we can use this optimized model to give it prompts and see what it comes up with. Any advice how to do so?

@arijaa.9315

8 months ago

I can not thank you enough! It is clear how much effort you put for such high quality explanation. Great explanation as usual!!

@nishantyadav6341

8 months ago

The fact that you dig deep into the algorithm and code sets you apart from the overflow of mediocre AI content online. I would pay to watch your videos, Umar. Thank you for putting out such amazing content.

@zhouwang2123

8 months ago

Thanks for your work and sharing, Umar! I learn new stuff from you again!
Btw, does the KL divergence play a similar role as the clipped ratio to prevent the new policy from far away from the old one? Additionally, unlike actor-critic in RL, here it looks like the policy and value functions are updated simultaneously. Is this because of the partially shared architecture and out of the computational efficiency?