At a previous job, I was dealing with tracking data from a website where the users could use a free text search field. When they entered a search term, it was sent to a web service that processed the request and - as a side effect - wrote the search term along with some metadata into an Oracle web database. This data was moved to a Hadoop file system with a program called Flume, parsed into a table format with Hive, written into tables that were queried with Impala and eventually presented in a reporting tool called Logi Analytics. As you can see, there are half a dozen different pieces of software involved, each of which had its own way of encoding characters. It is not surprising that when I looked at the search terms with Impala, it turned out that all the special characters people had entered were completely messed up.