Text Data Analysis methologys
1. Tokenization
Process of dividing text into a set of meaningful pieces. There are three methods of the nltk.tokenize: Sentence-tokenized, Word_tokenize and WordPunctTokenizer.
2. Stemming
Word can appear in various forms, reduce these different forms into a common base form. Three stemmer: PorterStemmer, LancasterStemmer, and SnowballStemmer. The LANCASTER is the strictest.
3. Lemmatization:
4. Chunking
Divide the input text into pieces with no constraints, chunks do not need to be meaningful at all.
5. Bag-of-words model
Dealing with text documents that consist of millions of words, converting them into numerical representations which usable for machine learning algorithms. Models each document by building a histogram of all of the words. Counts the number of occurrences of each word in the document use scikit-learn, represent documents by ignoring the word order.
6. Text classifier
Sort text documents into different classes based on a statistic tf- idf (Term Frequency-Inverse Document Frequency) which technique is used frequently in information retrieval.
7. Identifying the gender of a name
Used the names corpus to extract labeled names, and then we classified the gender based on the final part of the name.
8. Analyzing the Sentiment of a sentence
Determining whether a given piece of text is positive or negative. Sentiment analysis from large groups of people called opinion mining.
9. Identifying patterns in text using topic modeling
Uncover a hidden thematic structure in a collection of documents, help to organize our documents in a better way. Use gensim library to identify patterns. Use latent Dirichlet allocation (LDA) for topic modeling. Identifying the important words or themes in a document, These words tend to determine what the topic is about.
10. Parts of speech tagging
Process of labeling the words correspond to particular lexical categories (nouns, verbs, adjectives, articles, pronouns, adverbs, conjunctions, and so on). Use spacy library to perform PoS tagging.
11. Word2Vec
Is a simple two-layer artificial neural network which allows to memorize the semantic and syntactic by constructing a vector space. Recognized as semantically similar if vectors of words are closer with words occur in the same linguistic contexts.
Nhận xét