Encoding in your data and IDE
One of the most annoying problems when working with text data is character encoding. The character encoding of a dataset determines how the characters in text files are stored as bytes and how the stored bytes respectively are interpreted as characters. Even though there are almost universal standards like the UTF-8 encoding, there are still many programs that do not stick to this standard for various reasons.
Usually, as long as you are working within a defined environment like a database, this won't be such a big problem, but as soon as you are writing data pipelines or ETLs that move data between different systems, incompatible character encodings are likely to occur.
At a previous job, I was dealing with tracking data from a website where the users could use a free text search field. When they entered a search term, it was sent to a web service that processed the request and - as a side effect - wrote the search term along with some metadata into an Oracle web database. This data was moved to a Hadoop file system with a program called Flume, parsed into a table format with Hive, written into tables that were queried with Impala and eventually presented in a reporting tool called Logi Analytics. As you can see, there are half a dozen different pieces of software involved, each of which had its own way of encoding characters. It is not surprising that when I looked at the search terms with Impala, it turned out that all the special characters people had entered were completely messed up.
Unfortunately, I don't have a solution for the general problem. Be aware of the danger of incompatible encodings, try to keep track of which piece of software uses which encoding and keep an eye open for hints of incorrectly encoded data. Specifically, look for entries that contain special characters and check if they are displayed correctly. Start at the very end of the data pipeline to see if the results look fine. If they don't, go back to the very beginning and follow the data, checking at every single step if the special characters have been messed up.
As a data scientist, you are often just a consumer of the data and not involved in the collection and integration of the data. You only get to see it after it already ran through a long pipeline of tools and transformations. Your first instinct might be to pull yourself up by the bootstraps and simply translate the incorrectly encoded characters to the correct ones. This is error-prone and just a fight against the symptoms of the underlying problem in the ETL pipeline. Always try to talk to the data engineers, quantify the extent of the problem, give them a few specific examples and ask them to fix the problem.
One more word of caution: You can also run into problems when you use non-standard encodings within your development environment. I specifically ran into this project when working on a project where we used different tools for querying a Teradata database. While I used Teradata Studio with the default settings, a colleague of mine used SQL Assistant. Every time we exchanged the SQL script that we were both working on, we ran into encoding problems.
Of course, it is common sense to not use special characters like the German Umlauts (Γ€, ΓΆ, ΓΌ) in variable or column names, but sometimes, you will have to type WHERE statements that include text comparisons where the constant string includes a special character. To avoid problems, make sure that you know what encoding your development environment uses and tell your colleagues to check if they are using the same encoding.
Takeaways:
  • Learn which type of character encoding your software uses
  • Be cautious when transferring data between different software systems
  • Check rows with values that contain special characters. Have they been correctly encoded?
  • Avoid special characters in your code
  • Don't fight the symptoms by trying to fix encoding problems at the very end of the data pipeline
Copy link