Text mining basics

While numerical and categorical data is somewhat easy to work with, text data has always been a challenge. For most methods, the unstructured text needs to be converted into a structured format. There are two common approaches for this:

First, the feature extraction approach: You use your domain knowledge to extract features directly from the text. For example, if your goal is to figure out which sections of the Harry Potter books focus on the character Albus Dumbledore, you may want count the appearance of terms like 'albus dumbledore', 'professor dumbledore', 'dumbledore', 'albus' and 'headmaster' within each section. This works well if you are interested in very specific, clearly defined features with a limited number of different terms.

Other features that one could directly extract from the text would be the mean and standard deviation of word and sentence length or the ratio of punctuation characters, digits or dates per sentence. A feature that has traditionally been used in email spam detection is the ratio of upper case to lower case letters. A large number of upper case characters is typically a good indicator of spam emails.

The second approach utilizes the so called document-term matrix: You create a dictionary of all words ever used in your dataset and build a matrix where each row represents a document, e.g. a news article, a product description or a section/chapter of a book. Each column corresponds to one of the terms from the dictionary. If the document i contains the word j, the value of the matrix at position (i, j) is 1, else 0. Often times, a filter condition is applied to the dictionary to remove all terms with low frequency. This takes out some of the 'noise' like typos and typically removes many spurious terms. This has the convenient side effect of greatly reducing the dimension of the document-term matrix, making it much easier to handle.

The document-term matrix, even with the low-frequency terms removed, will be very sparse, i.e. it will contain mostly values of zero and very few values of one. Most software packages for text mining will exploit this fact and save it as a sparse matrix to save memory and simplify subsequent computations.

In the document-term matrix, the occurrence of each word is a binary feature. Since this will result in thousands of features, dimensionality reduction techniques are usually applied. A high number of words will be reduced to a significantly smaller number of 'topics' and each word receives a score to indicate how strongly related it is to each topic.

I don't want to go any deeper into text mining methods at this point because it is beyond the scope of this book. But I need to emphasize that data preprocessing is just as important for text data as it is for every other type of data. You can always create a document-term matrix for a set of documents and a dictionary, but if your data preprocessing sucked, so will your results.

Takeaways:

  • In text mining, features are generated using an explicit and conscious feature extraction approach or a more general document-term-matrix approach

  • Ensuring data quality standards is just as important in text mining as it is in any other form of data analysis

Last updated