📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Text mining problems

Dummy words

PreviousLookalike charactersNextInconsistent timestamp formats

Last updated 4 years ago

Was this helpful?

Let's assume you are trying to classify text with a binary classifier into documents about history and other documents. From experience, you know that documents related to historical topics usually contain a lot of dates, so you reason that a large number of dates might be a good indicator for the text being about history.

You could simply create a feature NUMBER_OF_DATES by matching typical date formats like DD.MM.YYYY or MM/DD/YYYY with regular expressions and counting the matches. This would give you a single feature. The problem with this approach is that you are not taking advantage of the entirety of the text and the full range of terms used, so you decide to use the typical text mining approach of creating a document-term matrix - giving you thousands of features, one for each term - and applying a dimension-reduction technique.

Looking at your document-term matrix, you see many date terms like 30.07.1947, 06.07.1946 and 26.07.1967. Regarding these dates as distinct 'terms' will usually not be helpful. You don't really care about the information contained within the date, but only about the fact that the one document contains many of these while another one does not.

In situations like these, you can create 'dummy words'. You write a regular expression that matches all dates and replace them with a dummy word like '_date_' or '_dd_mm_yyyy_'. Similarly, you could replace all numbers with '_number_' or '_12345_' or any other. I suggest using a common format to indicate dummy words, for example leading and trailing underscores like I did above.

You could go even further and use more than one dummy word. Maybe you want your binary classifier to only identify books about history before the 20th century. In this case, you could replace date before the 20th century with both '_date_' and '_date_before_20th_century_', while replacing a date like '18.07.1967' only with '_date_'.

Using a dummy word for dates makes all dates appear as the same term in the document-term matrix. After all, the main information of the document-term matrix is how many terms any pair of documents have in common and how specific each term is to any given document (see the TF-IDF measure[1]). If each date was counted as a separate term, it would be quite possible that there is no overlap and that any subsequent algorithm to cluster or classify the documents will have difficulty recognizing the similarity between them. Also, many dimensionality reduction techniques identify topics be evaluating co-occurrences of terms. But if every date is interpreted as a different term, that makes it much harder to recognize co-occurrences of other terms with dates.

Takeaways:

  • Different numbers and dates will be interpreted as separate terms in a document-term-matrix. Use dummy words to avoid this

  • You can also use dummy words to distinguish between different kinds of dates and numbers

[1]

https://en.wikipedia.org/wiki/tfidf