📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Type- and format-related problems

Inconsistent timestamp formats

PreviousDummy wordsNextWhitespace-padded strings

Last updated 4 years ago

Was this helpful?

Dates and timestamps are the single most annoying data type in any programming language. There are dozens of different ways to write the same timestamp. Every database uses a different format and a different logic for date and time types. Parsing a text field of a given format works differently in almost every programming or scripting language. And don't even get me started on time zones and daylight saving time!

When you are working with string timestamps and time zones aren't an issue, the best format to use is, in my opinion, 'YYYY-MM-DD hh:mm:ss' where

  • YYYY is the year between 0000 and 9999

  • MM is the month number between 01 and 12

  • DD is the day of the month between 01 and 31

  • hh is the hour of the day[1] between 0 and 23

  • mm is the minute[2] between 0 and 59 and

  • ss is the second between 00 and 59.

There are several advantages to this approach. The lexicographical order is identical with the actual order of the dates/times of this string representation. This makes sorting the dates from newest to oldest or from oldest to newest simple. You don't need to convert it to a date type or a numeric type. Also, the constant length of 19 characters means that extracting the date is as simple as taking the substring from position 1 to position 10. Splitting by the space character gives a column with the date and a second column with the time.

There are a few variations of the above format that you should be aware of. Sometimes, you will find a Z after the time, e.g. '1967-07-18 20:49:55Z'. The Z is the time zone indicator for the Coordinated Universal Time UTC[3]. To indicate another time zone, add the offset from UTC directly after the time. The format to use for the offset is '+hh:mm' or '-hh:mm', conversely. For example, the time in Berlin, Germany, during the winter would have an additional '+01:00' at the end. Sometimes, '+0100' is used instead. Example: '1967-07-18T20:49:55+0100'. The T between the date part and the time part of the timestamp is just an alternative character to separate time and date, sometimes replacing the space character.

The disadvantage of string timestamps is pretty obvious: There is no direct way to perform date arithmetic like calculating time differences or adding a constant number of seconds or days to a given timestamp. Since this is often useful, I would advise you to always convert your timestamps to the datetime/timestamp format recommended in your environment. Just make sure to check if your assumptions about the format of the timestamps are correct for all records, not just for the first few that you may have inspected manually. Check the first few records to see which format is used, write a regular expression to capture this format and search for any record where the timestamp string does not match this expression.

Takeaways:

  • The format YYYY-MM-DD hh:mm:ss is useful because it allows you to sort by date without any modifications

  • If you want to use date and time arithmetic, e.g. adding hours, days or months to a given date/timestamp, convert to your environments preferred date or timestamp format

  • Use regular expressions to make sure that all rows use the same date/timestamp format

[1] Sometimes, hh is only the hour of the day from 0 to 12. This can lead to some tricky problems when you convert from timestamp to text. You will not get an error and all the values will look correct. But there will be no times with a hour between 13 and 24. [2] Some languages are case-insensitive. In this case, mm or MM typically stand for the month of the year while mi or MI stand for the minute [3]

https://en.wikipedia.org/wiki/Coordinated_Universal_Time