Inconsistent timestamp formats

Dates and timestamps are the single most annoying data type in any programming language. There are dozens of different ways to write the same timestamp. Every database uses a different format and a different logic for date and time types. Parsing a text field of a given format works differently in almost every programming or scripting language. And don't even get me started on time zones and daylight saving time!

When you are working with string timestamps and time zones aren't an issue, the best format to use is, in my opinion, 'YYYY-MM-DD hh:mm:ss' where

  • YYYY is the year between 0000 and 9999

  • MM is the month number between 01 and 12

  • DD is the day of the month between 01 and 31

  • hh is the hour of the day[1] between 0 and 23

  • mm is the minute[2] between 0 and 59 and

  • ss is the second between 00 and 59.

There are several advantages to this approach. The lexicographical order is identical with the actual order of the dates/times of this string representation. This makes sorting the dates from newest to oldest or from oldest to newest simple. You don't need to convert it to a date type or a numeric type. Also, the constant length of 19 characters means that extracting the date is as simple as taking the substring from position 1 to position 10. Splitting by the space character gives a column with the date and a second column with the time.

There are a few variations of the above format that you should be aware of. Sometimes, you will find a Z after the time, e.g. '1967-07-18 20:49:55Z'. The Z is the time zone indicator for the Coordinated Universal Time UTC[3]. To indicate another time zone, add the offset from UTC directly after the time. The format to use for the offset is '+hh:mm' or '-hh:mm', conversely. For example, the time in Berlin, Germany, during the winter would have an additional '+01:00' at the end. Sometimes, '+0100' is used instead. Example: '1967-07-18T20:49:55+0100'. The T between the date part and the time part of the timestamp is just an alternative character to separate time and date, sometimes replacing the space character.

The disadvantage of string timestamps is pretty obvious: There is no direct way to perform date arithmetic like calculating time differences or adding a constant number of seconds or days to a given timestamp. Since this is often useful, I would advise you to always convert your timestamps to the datetime/timestamp format recommended in your environment. Just make sure to check if your assumptions about the format of the timestamps are correct for all records, not just for the first few that you may have inspected manually. Check the first few records to see which format is used, write a regular expression to capture this format and search for any record where the timestamp string does not match this expression.

Takeaways:

  • The format YYYY-MM-DD hh:mm:ss is useful because it allows you to sort by date without any modifications

  • If you want to use date and time arithmetic, e.g. adding hours, days or months to a given date/timestamp, convert to your environments preferred date or timestamp format

  • Use regular expressions to make sure that all rows use the same date/timestamp format

[1] Sometimes, hh is only the hour of the day from 0 to 12. This can lead to some tricky problems when you convert from timestamp to text. You will not get an error and all the values will look correct. But there will be no times with a hour between 13 and 24. [2] Some languages are case-insensitive. In this case, mm or MM typically stand for the month of the year while mi or MI stand for the minute [3] https://en.wikipedia.org/wiki/Coordinated_Universal_Time

Last updated