The next step is to make sure that every line is a single, complete record and that there are no multi-line logs. First, you look at the first few lines of the file to get a feel for the structure of the lines. It may look like every line begins with a timestamp formatted as YYYY-MM-DD hh:mm:ss
, but keep in mind that you are only looking at the first few hundred lines. If your file has millions of lines, you need to check your assumption that the rest of the lines follow the same pattern. In this case, you would check if all lines match the following regular expression: '^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*'
. Admittedly, this regex would also match '9999-99-99 99:99:99'
, which is obviously not a valid timestamp. You can make it more specific, but usually this regex is good enough to detect possible problems. Please also note that the last part, the '.*'
, is not always necessary. Some regular expression tools require this to be able to match the line containing the pattern, others don't.