Agent: The agent is the software program that learns to make intelligent decisions, such as a software program that plays chess intelligently.
Environment: The environment is the world of the agent. If we continue with the chess example, a chessboard is the environment where the agent plays chess.
State: A state is a position or a moment in the environment that the agent can be in. For example, all the positions on the chessboard are called states.
Action: The agent interacts with the environment by performing an action and moves from one state to another, for example, moves made by chessmen are actions.
Reward: A reward is a numerical value that the agent receives based on its action. Consider a reward as a point. For instance, an agent receives +1 point (reward) for a good action and -1 point (reward) for a bad action.
Action space: The set of all possible actions in the environment is called the action space. The action space is called a discrete action space when our action space consists of discrete actions, and the action space is called a continuous action space when our actions space consists of continuous actions.
Policy: The agent makes a decision based on the policy. A policy tells the agent what action to perform in each state. It can be considered the brain of an agent. A policy is called a deterministic policy if it exactly maps a state to a particular action. Unlike a deterministic policy, a stochastic policy maps the state to a probability distribution over the action space. The optimal policy is the one that gives the maximum reward.
Episode: The agent-environment interaction from the initial state to the terminal state is called an episode. An episode is often called a trajectory or rollout.
Episodic and continuous task: An RL task is called an episodic task if it has a terminal state, and it is called a continuous task if it does not have a terminal state.
Horizon: The horizon can be considered an agent’s lifespan, that is, the time step until which the agent interacts with the environment. The horizon is called a finite horizon if the agent-environment interaction stops at a particular time step, and it is called an infinite horizon when the agent environment interaction continues forever.
Return: Return is the sum of rewards received by the agent in an episode.
Discount factor: The discount factor helps to control whether we want to give importance to the immediate reward or future rewards. The value of the discount factor ranges from 0 to 1. A discount factor close to 0 implies that we give more importance to immediate rewards, while a discount factor close to 1 implies that we give more importance to future rewards than immediate rewards.
Value function: The value function or the value of the state is the expected return that an agent would get starting from state s following policy 𝜋𝜋.
Q function: The Q function or the value of a state-action pair implies the expected return an agent would obtain starting from state s and performing action a following policy 𝜋𝜋.
Model-based and model-free learning: When the agent tries to learn the optimal policy with the model dynamics, then it is called model-based learning; and when the agent tries to learn the optimal policy without the model dynamics, then it is called model-free learning.
Deterministic and stochastic environment: When an agent performs action a in state s and it reaches state 𝑠𝑠′ every time, then the environment is called a deterministic environment. When an agent performs action a in state s and it reaches different states every time based on some probability distribution, then the environment is called a stochastic environment.
Nhận xét