In data, not everything is what it looks like. This is a lesson I had to learn the hard way when I ventured into text mining. When you have large amounts of text data, expect all types of exotic characters to appear, especially if the data is scraped from the web. A particularly tricky type of characters are lookalikes. These are characters that look like certain characters but are actually different ones. Here are a few common examples.
-is a punctuation mark used in several different situations, most often to break words into parts or join separate words into a single word like
twentieth-century writers. There is nothing much you can do wrong with such a simple character, right? Wrong!
Unicode has at least a dozen characters that look like the hyphen. There's the hyphen-minus character on your keyboard, which is represented in Unicode as
U+002D, the regular hyphen
U+2010, the non-breaking hyphen
U+2011, the minus sign
U+2212, the small hyphen-minus
U+FE63, the full width hyphen-minus
U+FF0D, the figure dash
U+2012that has the same width as digits, the en dash
U+2013used to indicate a range of values and many other similar-looking characters.
Another character class with a huge variety of lookalike characters are quotation marks. On Wikipedia, you can find a summary page with at least a dozen different quotation mark characters - and that does not even include the variety of apostrophes that look like single quotes.
There are even characters that look like regular letters. The Latin character A
U+0041looks almost exactly like the Cyrillic letter A that is
Why is this important for you to know and be aware of? Lookalike characters will confuse your text mining tools and functions and result in errors that are difficult to debug. When you calculate statistics on the frequency and distribution of words in the text, they lead to strange results.
The best way to find out if your text contains lookalike characters that may cause trouble is, as described above, to remove all expected characters and see what is left. I recommend replacing all lookalike characters with their prototype character or at least one distinct representation. For example, I would replace all the hyphen characters with the hyphen-minus (
U+002D). When you are writing the code to do the replacement, avoid copying all the special characters into your code and use the characters' Unicode representations instead.
On a side note, lookalike characters are occasionally used in a spoofing attack called 'internationalized domain name homograph attack'. In this attack, the attacker attempts to make you click on a link to a supposedly safe URL that you trust. However, one or more of the characters in the URL have been replaced by lookalike characters, so the link could take you to a completely different site programmed by the attacker.
- Check for lookalike characters and replace them with their 'prototype'
- Do not copy lookalike characters into your code. Use their Unicode representations instead