Semi-structured log files
Log files from software applications are an interesting type of data in that they are semi-structured and usually need to be parsed into tables before they can be analyzed properly. They typically come in flat files, meaning in a plain, human-readable text format with a .log file extension. Almost every software program produces log messages. Typically, each row represents an event that happened during the execution of the software. Some programs write down error messages they encounter during their runtime while others take a note every time a user interacts with the program.
A webserver for example processes requests from users and delivers websites. The webserver logs contain the requests that the users triggered by clicking on stuff on the website, the timestamp of the request and the response status code. The latter tells us if the request was processed correctly or if the webserver failed to deliver the site, for example because the user requested a site that has been removed from the server - the infamous 404 error that most Internet users are familiar with.
Webserver logs basically give you free web tracking. For example, you can use them to
  • Calculate standard web tracking metrics like bounce rate, page views and the number of visitors
  • Identify the most popular sites on your website
  • Build a workflow to visualize customer journeys
  • Analyze crawler traffic from Google, Bing etc.
  • Identify malicious crawlers that steal your content or data
  • Detect broken or incorrect links by monitoring 4xx errors, e.g. 404 'page not found'
  • Identify server problems by looking for 5xx HTTP status codes, e.g. 503 'service unavailable'
This is just a small number of things you can do with the log data from a webserver - data that you get basically for free. Of course, the webserver logs do not completely replace a professional web tracking service. These services use parameterized tracking pixels that contain additional information. They also offer tools to do the tedious data cleaning, bot traffic removal, aggregation and visualization for you. But if you are willing to get your hands dirty and dig into the log files, there is a lot of information to be found.
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
The above example[1] shows a log record from an Apache HTTP Server. You can see the IP address of the client, the HTTP GET request, the HTTP status code and some information on the browser used to send the request, among other things.
In most types of log files, every line represents one logged event. While there is a fixed structure within these lines - each one may start with a timestamp, followed by an IP address, followed by a log level indicator, followed by the log message - it is not as structured as for example a CSV file that you can often directly import into your application. When you write the code to read the data into your program, you need to tell the program where the timestamp ends and the IP address begins. That way, it can put the timestamp into one column in a data table, the IP address into another one and so on. Getting the data into this table structure is often required before you can apply further transformations on it or query it according to your needs. The process of splitting the lines of the log file into different fields is called 'parsing' the log and the part of the code that does it is called 'parser'. It is also sometimes referred to as deserializing or unmarshalling[2] the log file.
It is typically helpful to first import the entire log file into your program as a single text column and start the parsing from there. This also allows you to directly compare the raw log text with the parsed columns and spot mistakes more easily.
The next step is to make sure that every line is a single, complete record and that there are no multi-line logs. First, you look at the first few lines of the file to get a feel for the structure of the lines. It may look like every line begins with a timestamp formatted as YYYY-MM-DD hh:mm:ss, but keep in mind that you are only looking at the first few hundred lines. If your file has millions of lines, you need to check your assumption that the rest of the lines follow the same pattern. In this case, you would check if all lines match the following regular expression: '^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*'. Admittedly, this regex would also match '9999-99-99 99:99:99', which is obviously not a valid timestamp. You can make it more specific, but usually this regex is good enough to detect possible problems. Please also note that the last part, the '.*', is not always necessary. Some regular expression tools require this to be able to match the line containing the pattern, others don't.
More often than not, you will find that there are indeed multi-line logs in your log file. Since many programs are written in Java, you will often come across the stack trace, a multi-line error description that is returned when Java encounters an exception. In other instances, your log may contain text entered by the user, for example a SQL query that was logged by the program. User-created text may contain unescaped line breaks showing up as line breaks in your log file.
If your log file contains multi-line logs, you need to identify which rows are the beginning of a log message. Since log messages typically start with a fixed pattern, most often a timestamp, you can easily recognize them. Use the regular expression from above to create a new column IS_START_OF_MESSAGE that is 1 if the row starts with this pattern and 0 if it doesn't.
Next, create another column MESSAGE_ID that is the cumulative sum of IS_START_OF_MESSAGE. In R, you can use the function cumsum() for this purpose. As you can see in the table below, this creates an ID for every log message. It only increases when a new log message starts.
MESSAGE
IS_START_OF_MESSAGE
MESSAGE_ID
2019-08-01 15:21:00 hello world
1
1
2019-08-01 15:22:03 start of
1
2
a multi-line log message
0
2
2019-08-01 15:25:12 and another
1
3
multi-line message
0
3
2019-08-01 15:22:03 ciao world
1
4
In order to get one log message per row, you can now group by MESSAGE_ID and use a concatenate function with newline as the separator to aggregate MESSAGE. This way, you will get a table with a unique MESSAGE_ID in the first and the concatenated multi-line log in the second column.
MESSAGE_ID
CONCATENATED_MESSAGE
1
2019-08-01 15:21:00 hello world
2
2019-08-01 15:22:03 start of\n a multi-line log message
3
2019-08-01 15:25:12 and another\n multi-line message
4
2019-08-01 15:22:03 ciao world
Another important check is to make sure that the log messages follow the same pattern from beginning to end. Try to write a regular expression that captures as much of the structure as possible. Check for log messages that don't match the pattern and modify the pattern in order to match them as well. This is an iterative game that you will need to play for a while. In many cases, it may be impossible to find a pattern that captures 100% of the messages[3], but if you make it to 99% coverage, that's typically enough for most use-cases.
The usual process of parsing the log file involves writing regular expressions to extract the individual fields from the raw text column. Typically, you can use the pattern from above and modify it by inserting capture groups to extract all relevant fields. After extracting the fields and creating a new table with the log message ID and all its attributes, you can - and should - check the data quality using all the methods for structured data that have been described in previous chapters.
Takeaways:
  • Use regular expressions to make sure that all lines start with the same pattern, e.g. a timestamp
  • Check for multi-line log messages With large log files, you may not be able to parse 100% of the lines correctly. Try to get to somewhere around 99% and stop there before wasting too much time
[1] Source of the example: https://httpd.apache.org/docs/2.4/logs.html [2] https://en.wikipedia.org/wiki/Serialization [3] Except for .* of course, but that would hardly qualify as 'capturing as much of the structure as possible'
Copy link