Text Data Analysis methologys 1. Tokenization Process of dividing text into a set of meaningful pieces. There are three methods of the nltk.tokenize: Sentence-tokenized, Word_tokenize and WordPunctTokenizer. 2. Stemming Word can appear in various forms, reduce these different forms into a common base form. Three stemmer: PorterStemmer, LancasterStemmer, and SnowballStemmer. The LANCASTER is the strictest. 3. Lemmatization: Reduce words to their base forms but more structured approach. 4. Chunking Divide the input text into pieces with no constraints, chunks do not need to be meaningful at all. 5. Bag-of-words model Dealing with text documents that consist of millions of words, converting them into numerical representations which usable for machine learning algorithms. Models each document by building a histogram of all of the words. Counts the number of occurrences of each word in the document use scikit-learn, represent documents by ignoring the word order. ...
Think big, start small, move fast.