Chuyển đến nội dung chính

Bài đăng

Hiển thị các bài đăng có nhãn nlp

[AI] Text Data Analysis methologys

Text Data Analysis methologys 1. Tokenization Process of dividing text into a set of meaningful pieces. There are three methods of the nltk.tokenize: Sentence-tokenized, Word_tokenize and WordPunctTokenizer. 2. Stemming Word can appear in various forms, reduce these different forms into a common base form. Three stemmer: PorterStemmer, LancasterStemmer, and SnowballStemmer. The LANCASTER is the strictest.  3. Lemmatization:  Reduce words to their base forms but more structured approach. 4. Chunking Divide the input text into pieces with no constraints, chunks do not need to be meaningful at all.  5. Bag-of-words model Dealing with text documents that consist of millions of words, converting them into numerical representations which usable for machine learning algorithms. Models each document by building a histogram of all of the words. Counts the number of occurrences of each word in the document use scikit-learn, represent documents by ignoring the word order.  ...

[AI] Two Methology convert text data into data structure in NLP

2 Methology convert text data into data structure (vector and matrix) 👉  Bag of Word (BoW): Evaluation the frequency of the words in that particular document. Sentence can be represented as a vector with length would be equal to the size of vocabulary. CountVectorizer is python libarary conveniently help in building BoW model. Limitations of the BoW: work well for certain tasks or use cases with a limited vocabulary, not scale to large vocabularies efficiently. 👉  TF-IDF vectors: Approach with weigh terms, vectorizing text and extracting features out of it. TF : account how frequently a term occurs in a document. IDF : justice to terms that occur not so frequently across documents. TF-IDF is computationally fast however does not take into account co-occurrence of terms, semantics, the context associated with terms. Both method use Cosine to evaluation how similar or dissimilar text documents.