2 Methology convert text data into data structure (vector and matrix) 👉 Bag of Word (BoW): Evaluation the frequency of the words in that particular document. Sentence can be represented as a vector with length would be equal to the size of vocabulary. CountVectorizer is python libarary conveniently help in building BoW model. Limitations of the BoW: work well for certain tasks or use cases with a limited vocabulary, not scale to large vocabularies efficiently. 👉 TF-IDF vectors: Approach with weigh terms, vectorizing text and extracting features out of it. TF : account how frequently a term occurs in a document. IDF : justice to terms that occur not so frequently across documents. TF-IDF is computationally fast however does not take into account co-occurrence of terms, semantics, the context associated with terms. Both method use Cosine to evaluation how similar or dissimilar text documents.
Think big, start small, move fast.