Dummy words

Let's assume you are trying to classify text with a binary classifier into documents about history and other documents. From experience, you know that documents related to historical topics usually contain a lot of dates, so you reason that a large number of dates might be a good indicator for the text being about history.

You could simply create a feature NUMBER_OF_DATES by matching typical date formats like DD.MM.YYYY or MM/DD/YYYY with regular expressions and counting the matches. This would give you a single feature. The problem with this approach is that you are not taking advantage of the entirety of the text and the full range of terms used, so you decide to use the typical text mining approach of creating a document-term matrix - giving you thousands of features, one for each term - and applying a dimension-reduction technique.

Looking at your document-term matrix, you see many date terms like 30.07.1947, 06.07.1946 and 26.07.1967. Regarding these dates as distinct 'terms' will usually not be helpful. You don't really care about the information contained within the date, but only about the fact that the one document contains many of these while another one does not.

In situations like these, you can create 'dummy words'. You write a regular expression that matches all dates and replace them with a dummy word like '_date_' or '_dd_mm_yyyy_'. Similarly, you could replace all numbers with '_number_' or '_12345_' or any other. I suggest using a common format to indicate dummy words, for example leading and trailing underscores like I did above.

You could go even further and use more than one dummy word. Maybe you want your binary classifier to only identify books about history before the 20th century. In this case, you could replace date before the 20th century with both '_date_' and '_date_before_20th_century_', while replacing a date like '18.07.1967' only with '_date_'.

Using a dummy word for dates makes all dates appear as the same term in the document-term matrix. After all, the main information of the document-term matrix is how many terms any pair of documents have in common and how specific each term is to any given document (see the TF-IDF measure[1]). If each date was counted as a separate term, it would be quite possible that there is no overlap and that any subsequent algorithm to cluster or classify the documents will have difficulty recognizing the similarity between them. Also, many dimensionality reduction techniques identify topics be evaluating co-occurrences of terms. But if every date is interpreted as a different term, that makes it much harder to recognize co-occurrences of other terms with dates.

Takeaways:

  • Different numbers and dates will be interpreted as separate terms in a document-term-matrix. Use dummy words to avoid this

  • You can also use dummy words to distinguish between different kinds of dates and numbers

[1] https://en.wikipedia.org/wiki/tfidf

Last updated