Text data, pretty much like every other type of data, often contains things that you don't expect. If your dataset is large enough, everything that can go wrong, will go wrong. Encoding problems, spelling errors, leading or trailing whitespace or line break characters, inconsistent formats, invisible control characters… expect all of that.
I can't tell how many hours of my life I have wasted searching for invisible control characters. These are characters that are not visible in the text and that do things like controlling certain formatting options, e.g. the non-breaking space or giving instructions to the printer. It is one of mankind's biggest mysteries how people manage to get these random characters into the data, but every time they do, a data scientist somewhere is going insane. Often times, people don't enter text the normal way through a keyboard, but by copy and pasting it from other applications, including emails and websites. These sources may contain all kinds of formatting instructions that get copied along with the text.
These types of special characters are devious. Often times, they will make data import tools fail with error messages that don't necessarily specify what exactly went wrong. Many databases and data analytics tools have trouble with certain special characters, particularly those pesky invisible control characters. Most functions do not expect them, and many algorithms will fail with obscure error messages. If you don't want to spend hours figuring out what went wrong, it's worth investing some time beforehand to deal with unexpected special characters in your text data.
I once ran into errors when trying to read a large CSV file into R. After hours of trying to figure out what was wrong, I started to read the file as plain text, removed all the characters I knew using regular expression replacement and checked what was left. After removing all alphabetical, numerical and whitespace characters using regular expression replacement functions, I was left with punctuation marks. Removing those was an iterative process since I did not have a complete list of all possible types of punctuation marks. Finally, when there was almost nothing left in my file, I discovered a few of these characters. It's important that you do this using a tool that has a visual representation of non-printable characters like Notepad++. Otherwise, you won't see that they are there.
Sometimes, simply removing this set of characters is enough, but this isn't always the best solution as it may mutilate words in the text. For example the German Umlaute
üshould preferably be replaced with
u. Diacritics like the accents (
ê) the cedilla (
ç), the tilde and others should be removed so that for example
êall simply become
In other cases, it may be better to replace unexpected characters with a space character to avoid merging multiple words separated by special characters into one word. You wouldn't want 'he/she/it' to be counted as 'hesheit' when you calculate word statistics, but rather as 'he', 'she' and 'it', so instead of simply removing '/', you should replace it with a space character before splitting the text into words.
Similarly, be careful when replacing or removing punctuation characters as it may have unintended consequences. If you replace all '.' characters with a blank space, a date like '19.09.1993' turns into '19 09 1993' which will then be split into the terms '19', '09' and '1993'. If you removed the punctuation characters, you may get the term '19091993'. Both results may not be what you expect.
- Special characters will often lead to confusing and unspecific error messages
- To search for special characters, strip the text field of all the expected characters like letters, numbers, whitespace and punctuation, until only the unexpected special characters are left
- Sometimes special characters should be removed, sometimes they should be replaced, e.g. with a whitespace character