[AI] Text Data Analysis methologys

Text Data Analysis methologys

1. Tokenization

Process of dividing text into a set of meaningful pieces. There are three methods of the nltk.tokenize: Sentence-tokenized, Word_tokenize and WordPunctTokenizer.

2. Stemming

Word can appear in various forms, reduce these different forms into a common base form. Three stemmer: PorterStemmer, LancasterStemmer, and SnowballStemmer. The LANCASTER is the strictest.

3. Lemmatization:

Reduce words to their base forms but more structured approach.

4. Chunking

Divide the input text into pieces with no constraints, chunks do not need to be meaningful at all.

5. Bag-of-words model

Dealing with text documents that consist of millions of words, converting them into numerical representations which usable for machine learning algorithms. Models each document by building a histogram of all of the words. Counts the number of occurrences of each word in the document use scikit-learn, represent documents by ignoring the word order.

6. Text classifier

Sort text documents into different classes based on a statistic tf- idf (Term Frequency-Inverse Document Frequency) which technique is used frequently in information retrieval.

7. Identifying the gender of a name

Used the names corpus to extract labeled names, and then we classified the gender based on the final part of the name.

8. Analyzing the Sentiment of a sentence

Determining whether a given piece of text is positive or negative. Sentiment analysis from large groups of people called opinion mining.

9. Identifying patterns in text using topic modeling

Uncover a hidden thematic structure in a collection of documents, help to organize our documents in a better way. Use gensim library to identify patterns. Use latent Dirichlet allocation (LDA) for topic modeling. Identifying the important words or themes in a document, These words tend to determine what the topic is about.

10. Parts of speech tagging

Process of labeling the words correspond to particular lexical categories (nouns, verbs, adjectives, articles, pronouns, adverbs, conjunctions, and so on). Use spacy library to perform PoS tagging.

11. Word2Vec

Is a simple two-layer artificial neural network which allows to memorize the semantic and syntactic by constructing a vector space. Recognized as semantically similar if vectors of words are closer with words occur in the same linguistic contexts.

Nhận xét

Bài đăng phổ biến từ blog này

[Tool] Apache Nifi

Introduction Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflows. It is highly configurable along several dimensions of quality of service, such as loss-tolerant versus guaranteed delivery, low latency versus high throughput, and priority-based queuing. NiFi provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state. See the System Administrator’s Guide for information about system requirements, installation, and configuration. Once NiFi is installed, use a supported web browser to view the UI. Browser Support Browser Version Chrome Current and Current - 1 FireFox Current and Current - 1 Edge Current an...

[AI] BÀI 4: Tác nhân và môi trường (Agent and Environment)

1. Agent (tác nhân): l à tất cả những gì có thể nhận thức về môi trường của nó thông qua cảm nhận "Sensor" và đưa ra hành động tác động đến môi trường (effective). Có 03 loại agent: human, software, robotic. + Cấu trúc của Agent: Gồm 2 phần: Architecture + Agent Program + Phân loại Agent: - Simple Reflex Agents: Agent phản ứng đơn giản. - Model Based Reflex Agents: Agent phản xạ dựa trên model - Goal Based Agents: Agent dựa trên mục tiêu. - Utility Based Agents: Agent dựa trên tính tiện ích. 2. Turing test : Ứng dụng trong việc kiểm tra và đáng giá máy móc có thật sự thông minh? https://vi.wikipedia.org/wiki/Ph%C3%A9p_th%E1%BB%AD_Turing 3. Các thuộc tính của môi trường Discrete / Continuous − If there are a limited number of distinct, clearly defined, states of the environment, the environment is discrete (For example, chess); otherwise it is continuous (For example, driving). Observable / Partially Observable ...

[ebook] Phần I - Tổng hợp nội dung sách "Nuôi con không phải cuộc chiến"

Phần I - Nuôi con không phải cuộc chiến Chương 1: ăn ngủ tự lập mẹ nhàn con ngoan EASY: eat - activity - Sleep - Your time -> Là chu kỳ sinh hoạt lặp đi lặp lại của bé trong một khoảng thời gian 1 ngày của bé. Có thể bạn quan tâm: Khuyến mãi mua trọn bộ sách Nuôi con không phải cuộc chiến I. Nếp sinh hoạt EASY 1. Lợi ích EASY: Đối với bé: + Nhận biết được những gì xảy ra tiếp theo -> Tăng khả năng tự tin của con. + Tập cho bé phản xạ có điều kiện. + Kết nối nhịp sing học của con. Đối vơí mẹ: + Biết cách phản ứng với những nhu cầu khác nhau của bé, không nhầm lẫn giữa khi bé khóc đòi ăn hay làm nũng. Về lâu dài: + EASY là nền tảng cơ bản giúp rèn luyện sự tự lập ở bé. + Tạo nếp sinh hoạt ăn ngủ điều độ. (Khi con càng lớn chu kỳ EASY càng dài ra) 2. Chu kỳ 03h: Cho bé từ 0 - 3 tháng tuổi. Bé ăn cách nhau 03 giờ. Cho con ngủ theo bảng thời gian hoặc căn cứ vào dấu hiệu của bé. Cân nặng tiêu chuẩn 2.7kg 3...

[Centos] Fix WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

When you SSH to Linux Server and meet this error: user@hostname ~]$ ssh root@pong @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is 6e:45:f9:a8:af:38:3d:a1:a5:c7:76:1d:02:f8:77:00. Please contact your system administrator. Add correct host key in /home/hostname /.ssh/known_hosts to get rid of this message. Offending RSA key in /var/lib/sss/pubconf/known_hosts:4 RSA host key for pong has changed and you have requested strict checking. Host key verification failed. => Way how to fix this problem, using this statement: ssh-keygen -R Example: ssh-keygen -R 422.62.159.23

[LB-HA] Understand about High Available (HA) and Load Balancing

High Available (HA) : Hỗ trợ dự phòng tiến trình. Hoạt đông với cơ chế Active - Passive . Hệ thống tồn tại 02 loại Component với role 'Active' và 'Passive'. Active Component sẽ đảm nhận việc xử lý tiến trình. Passive Component đóng vai trò backup. Trường hợp Active Component gặp lỗi (fail, downtime) hệ thống sẽ chuyển sang hoạt động trên B ackup Component . Quá trình chuyển từ Active Component sang Passive Component gọi là 'Fail over'. Một số khái niệm liên quan đến HA: - FailOver: Chuyển đổi tiến trình chạy trên Passive Component khi Active Component gặp sự cố. - Fail Back: Khôi phục lại tiến trình hoạt động trên Active Component sau khi tiến trình dịch chuyển đến Passive Component trong quá trình FailOver. - Fault - Tolerant: Công nghệ giúp đảm bảo tính liên tục của dịch vụ. Trường hợp một thành phần trong hệ thống bị hoạt động gián đoạn vẫn cho phép toàn bộ hệ thống hoạt động ổn định. Load Balancing : Hoạt động với cơ chế Active - Active ....

[Network] ARQ - Automatic repeat request

Automatic Repeat reQuest (ARQ) hay Automatic Repeat Query là một phương thức điều khiển lỗi cho quá trình truyền dữ liệu bằng cách sử dụng ACK (acknowledgements) và Time Out, cho phép truyền dữ liệu tin cậy trên nền một dịch vụ không tin cậy (unreliable service). 1. ARQ protocol Gồm 03 loại Stop-and-wait ARQ Go-Back-N ARQ Selective Repeat ARQ / Selective Reject 2. Lĩnh vực liên quan Linked Data Transport Layer OSI Model. Ngoài ra có một số bằng sáng chế trong lĩnh vực live video contribution environments sử dụng tới ARQ.

Vu's Blog

Tìm kiếm Blog này