Binary data
On a computer, all data is stored in the form of bits, i.e. zeros and ones. In order to translate this into numbers or text characters, your computer needs to interpret these bits. A group of 8 bits is called a byte. A byte can represent a number between 0 and 255. But it could also represent a number between -127 and +127 if you use one bit to indicate the minus sign. Groups of bytes can be used to represent larger numbers, e.g.
Number of bytes | Name | From | To |
1 (8 bits) | byte | -128 | 127 |
2 (16 bits) | short | −32,768 | 32,767 |
4 (32 bits) | int | −2,147,483,648 | 2,147,483,647 |
8 (64 bits) | int64 | −9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |
The names given to these data types are very inconsistent. For example, depending on the programming language, the 8-byte signed integer is being referred to as doubleword, longword, long long, quad, quadword, qword, int64, i64 or u64.[1] This can be confusing, but it gets even more confusing when it comes to floating point numbers[2], which is a topic that goes far beyong the scope of this book. I'm just using this as an example to show you that your computer needs very detailed instructions to be able to read a stream of bits and turn it into a data table.
Files on your computer usually have a file format. On Windows, the extension (e.g. .csv
or .exe
) tells your computer what set of instructions to use to make sense of the bits. For some types of files like CSV or JSON, the contents of the files can be displayed in plain text. When you open these files, your computer displays them as human-readable text. They may not be particularly convenient to look at, but you are - at least theoretically - able to understand them.
This text representation is an additional layer of the data. A CSV file is first translated from bits to text and then from text to a data table in your analytics environment. While this is convenient for the data scientist who can peek at the text representation of the data to get a first impression, it comes at the cost of slower reading speed and larger file size since no compression is being used.
It's more efficient to store and process data in a binary format like, for example, Parquet. It allows you to store metadata along with your data, encode and compress each column individually with the best-suited compression algorithm and read only selected columns of the data without having to read the whole file. Parquet files are significantly smaller than plain text files and much faster to read. On top of that, the ability to store metadata avoids confusion about the data type of the columns.
When you open a binary file in a text editor, the only thing you see will be seemingly arbitrary numbers, digits, punctuation- and control characters. This happens because the text editor assumes that the bytes stand for letters and tries to display them. You need a so-called parser to read them - a fast piece of software that knows how to interpret the sequence of bits correctly. The parser can convert it into a data structure native to your programming language, for example a data frame, vector or list in R. To be able to do this, the parser must know what every bit and byte in the file stands for and how to read them correctly: Do the bits need to be read forwards or backwards (endianess)? Which bytes contain the name of the column and what text encoding is used for the letters? Which bytes contain the corresponding cell of the first record? How to recognize the end of a record? Which decompression algorithm needs to be applied to which part of the data? All of this needs to be programmed into the parser in order to get clean, reliable data from a source file.
Parsing[3] binary data yourself is like making your own paper sheets before writing a novel. I hope that you will never encounter a situation where you need to write your own parser for data in a binary file format. While it is a great exercise for computer science students, it is usually a step outside the comfort zone for data scientist. It also requires a lot of knowledge about your programming language to write a parser that is fast enough to read large amounts of data this way. It requires even greater knowledge to write one that fails gracefully upon encountering an error or some unexpected structure within the data.
Fortunately, you will rarely have to manually write your own parser. For almost all data formats that you will encounter, there will usually be libraries or packages with functions for reading and writing them. For example, if you want to read an Excel file into R, you will likely use the 'xlsx' package. Nobody would ever think about writing their own parser for Excel files. And if you do… stop it!
Unfortunately, there are a few exceptions. One of these is data from machines. Sometimes, you have the questionable pleasure to work with a machine that sends data to a computer where a proprietary driver processes this data and presents it in a nice, formatted way. However, this front end may have no export function, or the export needs to be started manually, which makes it hard to integrate it into an automated workflow. So, you decide to pull the data directly from the machine to discover that it's only binary data files. Unless you have a way to utilize the existing driver, you'll need to write you own parser based on the documentation the hardware manufacturer has provided for the raw data file - IF he has provided any documentation at all. With the 'Internet of Things' becoming a popular trend, this becomes less and less a problem as hardware manufacturers start providing APIs to their devices for machine-to-machine communication and standardized data transfer.
Another example are new data structures where nobody has written a parser yet. What kind of esoteric data might that be? You don't have to look too far. In the so called 'crypto summer' of 2017 - named after the term crypto currency - I was interested in analyzing the data from the Bitcoin blockchain. I wanted to use R, and since there was no parser in R out there (at the time), I wrote my own. It is an interesting challenge and I learned a lot, but not being familiar with binary data storage, I was clearly out of my depths as a data scientist. While writing your own parser is an interesting challenge, I strongly recommend avoiding it if possible.
Takeaways:
At all cost, avoid writing your own parser for a binary file format. Use existing parsers whenever you can
[1] https://en.wikipedia.org/wiki/Integer_(computer_science) [2] https://en.wikipedia.org/wiki/Floating-point_arithmetic [3] Parsing binary data is also referred to as deserializing or demarshalling
Last updated