📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Text mining problems

Lookalike characters

PreviousCharacter entitiesNextDummy words

Last updated 4 years ago

Was this helpful?

In data, not everything is what it looks like. This is a lesson I had to learn the hard way when I ventured into text mining. When you have large amounts of text data, expect all types of exotic characters to appear, especially if the data is scraped from the web. A particularly tricky type of characters are lookalikes. These are characters that look like certain characters but are actually different ones. Here are a few common examples.

The hyphen - is a punctuation mark used in several different situations, most often to break words into parts or join separate words into a single word like twentieth-century writers. There is nothing much you can do wrong with such a simple character, right? Wrong!

Unicode has at least a dozen characters that look like the hyphen. There's the hyphen-minus character on your keyboard, which is represented in Unicode as U+002D, the regular hyphen U+2010, the non-breaking hyphen U+2011, the minus sign U+2212, the small hyphen-minus U+FE63, the full width hyphen-minus U+FF0D, the figure dash U+2012 that has the same width as digits, the en dash U+2013 used to indicate a range of values and many other similar-looking characters.

Another character class with a huge variety of lookalike characters are quotation marks. On Wikipedia, you can find a summary page with at least a dozen different quotation mark characters[1] - and that does not even include the variety of apostrophes that look like single quotes.

There are even characters that look like regular letters. The Latin character A U+0041 looks almost exactly like the Cyrillic letter A that is U+0410.

Why is this important for you to know and be aware of? Lookalike characters will confuse your text mining tools and functions and result in errors that are difficult to debug. When you calculate statistics on the frequency and distribution of words in the text, they lead to strange results.

The best way to find out if your text contains lookalike characters that may cause trouble is, as described above, to remove all expected characters and see what is left. I recommend replacing all lookalike characters with their prototype character or at least one distinct representation. For example, I would replace all the hyphen characters with the hyphen-minus (U+002D). When you are writing the code to do the replacement, avoid copying all the special characters into your code and use the characters' Unicode representations instead.

On a side note, lookalike characters are occasionally used in a spoofing attack called 'internationalized domain name homograph attack'[2]. In this attack, the attacker attempts to make you click on a link to a supposedly safe URL that you trust. However, one or more of the characters in the URL have been replaced by lookalike characters, so the link could take you to a completely different site programmed by the attacker.

Takeaways:

  • Check for lookalike characters and replace them with their 'prototype'

  • Do not copy lookalike characters into your code. Use their Unicode representations instead

[1] [2]

https://en.wikipedia.org/wiki/Quotation_mark#Unicode_code_point_table
https://en.wikipedia.org/wiki/IDN_homograph_attack