Proprietary data formats

Proprietary data formats are files that you can only open with the specific software that they were created with. While file formats like CSV, XML, JSON as well as Parquet, Avro or ORC can all be read using open source tools that are available for many programming languages, some software developers decide to use their own, self-developed storage formats that are often optimized for their specific use case. A typical example is the SAS7BDAT format used by the statistics software SAS to store data.

Many proprietary data formats use compression and store the data as binary data. You will recognize this when you open such a file with a text editor and see lots of weird characters. In proprietary data formats, the internal structure of the file is often unknown because it's considered a trade secret by the software developer. But there are also some open data formats that are tied to a specific application. A well-known example is the .rdata format that R uses to store objects for later use. In theory, everybody could study the code and write a parser to read and write .rdata files. But I've never seen any other tool using this format - most likely because the way they store data is closely tied to the structure of objects in R.

When you are working with a software that provides a proprietary or application-specific data format for storage, it can be useful to use this format. The SAS7BDAT format has a lot of advantages over CSV if you are working within SAS because it can be read and written quickly and preserves column type information, among other benefits. When you are using .rdata or .rds files with R, you can write and read objects from your R workspace with a single function call. Even complex objects like models, which are usually represented as complex, nested lists with elements of various different types, can be stored and retrieved just as easily as a single variable.

The most common use case for using proprietary file formats is the storage of intermediate results. A model that you trained in R can be stored in a .rdata or .rds file and retrieved when it is needed. You may also want to store the results of expensive calculations like a sparse term-document matrix obtained from a corpus of documents or a user-item matrix from a collaborative filtering workflow obtained from a list of transactions.

Avoid proprietary file formats when you need to hand data over to someone else unless you know for sure that they use a software that can handle it. Ask what data format the person requesting the data would prefer. If you are the person requesting the data, make sure to request it to be stored in an open format.

When you are in the unlucky position to have received data in a proprietary format that your analysis software cannot read, there is no hack to solve this. Make sure to re-request the data in an open format or install the corresponding software, read the data in and export it in an open format yourself. I would discourage you from taking the latter approach because an error-free transfer of data between two systems can already be a difficult task in and of itself. Adding a third system between the source and the target system will make it much more difficult and prone to errors.

Takeaways:

  • Feel free to use proprietary data formats to save intermediate results

  • Do not use proprietary data formats to exchange data

  • When requesting data, ask for it to be stored in an open format

Last updated