[AI] Fundamental concepts of Reinforcement Learning

Agent: The agent is the software program that learns to make intelligent decisions, such as a software program that plays chess intelligently.

Environment: The environment is the world of the agent. If we continue with the chess example, a chessboard is the environment where the agent plays chess.

State: A state is a position or a moment in the environment that the agent can be in. For example, all the positions on the chessboard are called states.

Action: The agent interacts with the environment by performing an action and moves from one state to another, for example, moves made by chessmen are actions.

Reward: A reward is a numerical value that the agent receives based on its action. Consider a reward as a point. For instance, an agent receives +1 point (reward) for a good action and -1 point (reward) for a bad action.

Action space: The set of all possible actions in the environment is called the action space. The action space is called a discrete action space when our action space consists of discrete actions, and the action space is called a continuous action space when our actions space consists of continuous actions.

Policy: The agent makes a decision based on the policy. A policy tells the agent what action to perform in each state. It can be considered the brain of an agent. A policy is called a deterministic policy if it exactly maps a state to a particular action. Unlike a deterministic policy, a stochastic policy maps the state to a probability distribution over the action space. The optimal policy is the one that gives the maximum reward.

Episode: The agent-environment interaction from the initial state to the terminal state is called an episode. An episode is often called a trajectory or rollout.

Episodic and continuous task: An RL task is called an episodic task if it has a terminal state, and it is called a continuous task if it does not have a terminal state.

Horizon: The horizon can be considered an agent’s lifespan, that is, the time step until which the agent interacts with the environment. The horizon is called a finite horizon if the agent-environment interaction stops at a particular time step, and it is called an infinite horizon when the agent environment interaction continues forever.

Return: Return is the sum of rewards received by the agent in an episode.

Discount factor: The discount factor helps to control whether we want to give importance to the immediate reward or future rewards. The value of the discount factor ranges from 0 to 1. A discount factor close to 0 implies that we give more importance to immediate rewards, while a discount factor close to 1 implies that we give more importance to future rewards than immediate rewards.

Value function: The value function or the value of the state is the expected return that an agent would get starting from state s following policy 𝜋𝜋.

Q function: The Q function or the value of a state-action pair implies the expected return an agent would obtain starting from state s and performing action a following policy 𝜋𝜋.

Model-based and model-free learning: When the agent tries to learn the optimal policy with the model dynamics, then it is called model-based learning; and when the agent tries to learn the optimal policy without the model dynamics, then it is called model-free learning.

Deterministic and stochastic environment: When an agent performs action a in state s and it reaches state 𝑠𝑠′ every time, then the environment is called a deterministic environment. When an agent performs action a in state s and it reaches different states every time based on some probability distribution, then the environment is called a stochastic environment.

Nhận xét

Bài đăng phổ biến từ blog này

[Tool] Apache Nifi

Introduction Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflows. It is highly configurable along several dimensions of quality of service, such as loss-tolerant versus guaranteed delivery, low latency versus high throughput, and priority-based queuing. NiFi provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state. See the System Administrator’s Guide for information about system requirements, installation, and configuration. Once NiFi is installed, use a supported web browser to view the UI. Browser Support Browser Version Chrome Current and Current - 1 FireFox Current and Current - 1 Edge Current an...

[AI] BÀI 4: Tác nhân và môi trường (Agent and Environment)

1. Agent (tác nhân): l à tất cả những gì có thể nhận thức về môi trường của nó thông qua cảm nhận "Sensor" và đưa ra hành động tác động đến môi trường (effective). Có 03 loại agent: human, software, robotic. + Cấu trúc của Agent: Gồm 2 phần: Architecture + Agent Program + Phân loại Agent: - Simple Reflex Agents: Agent phản ứng đơn giản. - Model Based Reflex Agents: Agent phản xạ dựa trên model - Goal Based Agents: Agent dựa trên mục tiêu. - Utility Based Agents: Agent dựa trên tính tiện ích. 2. Turing test : Ứng dụng trong việc kiểm tra và đáng giá máy móc có thật sự thông minh? https://vi.wikipedia.org/wiki/Ph%C3%A9p_th%E1%BB%AD_Turing 3. Các thuộc tính của môi trường Discrete / Continuous − If there are a limited number of distinct, clearly defined, states of the environment, the environment is discrete (For example, chess); otherwise it is continuous (For example, driving). Observable / Partially Observable ...

[ebook] Phần I - Tổng hợp nội dung sách "Nuôi con không phải cuộc chiến"

Phần I - Nuôi con không phải cuộc chiến Chương 1: ăn ngủ tự lập mẹ nhàn con ngoan EASY: eat - activity - Sleep - Your time -> Là chu kỳ sinh hoạt lặp đi lặp lại của bé trong một khoảng thời gian 1 ngày của bé. Có thể bạn quan tâm: Khuyến mãi mua trọn bộ sách Nuôi con không phải cuộc chiến I. Nếp sinh hoạt EASY 1. Lợi ích EASY: Đối với bé: + Nhận biết được những gì xảy ra tiếp theo -> Tăng khả năng tự tin của con. + Tập cho bé phản xạ có điều kiện. + Kết nối nhịp sing học của con. Đối vơí mẹ: + Biết cách phản ứng với những nhu cầu khác nhau của bé, không nhầm lẫn giữa khi bé khóc đòi ăn hay làm nũng. Về lâu dài: + EASY là nền tảng cơ bản giúp rèn luyện sự tự lập ở bé. + Tạo nếp sinh hoạt ăn ngủ điều độ. (Khi con càng lớn chu kỳ EASY càng dài ra) 2. Chu kỳ 03h: Cho bé từ 0 - 3 tháng tuổi. Bé ăn cách nhau 03 giờ. Cho con ngủ theo bảng thời gian hoặc căn cứ vào dấu hiệu của bé. Cân nặng tiêu chuẩn 2.7kg 3...

[LB-HA] Understand about High Available (HA) and Load Balancing

High Available (HA) : Hỗ trợ dự phòng tiến trình. Hoạt đông với cơ chế Active - Passive . Hệ thống tồn tại 02 loại Component với role 'Active' và 'Passive'. Active Component sẽ đảm nhận việc xử lý tiến trình. Passive Component đóng vai trò backup. Trường hợp Active Component gặp lỗi (fail, downtime) hệ thống sẽ chuyển sang hoạt động trên B ackup Component . Quá trình chuyển từ Active Component sang Passive Component gọi là 'Fail over'. Một số khái niệm liên quan đến HA: - FailOver: Chuyển đổi tiến trình chạy trên Passive Component khi Active Component gặp sự cố. - Fail Back: Khôi phục lại tiến trình hoạt động trên Active Component sau khi tiến trình dịch chuyển đến Passive Component trong quá trình FailOver. - Fault - Tolerant: Công nghệ giúp đảm bảo tính liên tục của dịch vụ. Trường hợp một thành phần trong hệ thống bị hoạt động gián đoạn vẫn cho phép toàn bộ hệ thống hoạt động ổn định. Load Balancing : Hoạt động với cơ chế Active - Active ....

[Network] ARQ - Automatic repeat request

Automatic Repeat reQuest (ARQ) hay Automatic Repeat Query là một phương thức điều khiển lỗi cho quá trình truyền dữ liệu bằng cách sử dụng ACK (acknowledgements) và Time Out, cho phép truyền dữ liệu tin cậy trên nền một dịch vụ không tin cậy (unreliable service). 1. ARQ protocol Gồm 03 loại Stop-and-wait ARQ Go-Back-N ARQ Selective Repeat ARQ / Selective Reject 2. Lĩnh vực liên quan Linked Data Transport Layer OSI Model. Ngoài ra có một số bằng sáng chế trong lĩnh vực live video contribution environments sử dụng tới ARQ.

[Centos] Fix WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

When you SSH to Linux Server and meet this error: user@hostname ~]$ ssh root@pong @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is 6e:45:f9:a8:af:38:3d:a1:a5:c7:76:1d:02:f8:77:00. Please contact your system administrator. Add correct host key in /home/hostname /.ssh/known_hosts to get rid of this message. Offending RSA key in /var/lib/sss/pubconf/known_hosts:4 RSA host key for pong has changed and you have requested strict checking. Host key verification failed. => Way how to fix this problem, using this statement: ssh-keygen -R Example: ssh-keygen -R 422.62.159.23

[AI] Text Data Analysis methologys

Text Data Analysis methologys 1. Tokenization Process of dividing text into a set of meaningful pieces. There are three methods of the nltk.tokenize: Sentence-tokenized, Word_tokenize and WordPunctTokenizer. 2. Stemming Word can appear in various forms, reduce these different forms into a common base form. Three stemmer: PorterStemmer, LancasterStemmer, and SnowballStemmer. The LANCASTER is the strictest. 3. Lemmatization: Reduce words to their base forms but more structured approach. 4. Chunking Divide the input text into pieces with no constraints, chunks do not need to be meaningful at all. 5. Bag-of-words model Dealing with text documents that consist of millions of words, converting them into numerical representations which usable for machine learning algorithms. Models each document by building a histogram of all of the words. Counts the number of occurrences of each word in the document use scikit-learn, represent documents by ignoring the word order. ...

Vu's Blog

Tìm kiếm Blog này