Juha-Matti Santala
Community Builder. Dreamer. Adventurer.

Parsing nginx server logs with regular expressions

Batteries included is a blog series about the Python Standard Library. Each day, I share insights, ideas and examples for different parts of the library. Blaugust is an annual blogging festival in August where the goal is to write a blog post every day of the month.

Python’s regular expressions module re provides an interface to do pattern matching with strings using regular expressions.

A very short intro to regular expressions

Regular expressions or regex amongst friends is a way to find regular patterns in strings. They are very powerful but also rather cryptic and hard to read, even with experience.

Regex patterns use special tokens that mean and match different things.

Most characters can be matched as-is by putting them as-is into the pattern:

  • a matches a
  • 10 matches 10

Some characters need to be escaped with \:

  • \[ matches [
  • \. matches .

Some special tokens can be created with backslashes:

  • \w means “any word character”
  • \s means “any whitespace”
  • \d means “any digit”

You can match any from a range with []:

  • [0-9] matches any digit (same as \d)
  • [A-Z] matches any capital letter between A and Z

You can combine them with qualifiers:

  • * means “0 or more” of the previous token
  • + means “1 or more” of the previous token
  • {2} means “exactly 2” of the previous token
  • {1, 3} means “at least one and at most 3” of the previous token

You can learn more about the available options in Python’s regular expressions documentation.

Many ways to achieve the same thing

Regular expressions are very powerful but also very prone for especially false positive errors and hard to read. However, I find that almost every time, you need to make some sort of a tradeoff.

If you’re using the regular expression on a one-off script or against an input that you know doesn’t have any surprises there, you can use much looser definitions for patterns compared to running it on an on-going service that may receive any sort of input from users.

Often it’s not worth it to spend all the extra time to write a perfect pattern that would work in every situation because it’s often really hard to come up with an error-proof pattern. Usually “good enough” is enough and what “good enough” means, needs to be evaluated case by case.

I’ll show an example of this when we get to parsing the IP address from the logs below.

Parsing nginx server logs

I used nginx-log-generator to generate the examples used in this blog post.

You can find a file with generated logs at https://gist.github.com/Hamatti/d7dcda5ac6ee4da4cf4fa3ad39118991.

Let’s take the first log entry from that list and analyse its structure:

106.169.54.184 - - [27/Jan/2024:15:06:45 +0000] "GET /Right-sized/Customer-focused_dedicated_complexity.png HTTP/1.1" 200 1165 "-" "Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/1952-20-02 Firefox/37.0"

For readability, I’ll split it to multiple lines. This will also help us construct the regex pattern.

106.169.54.184
- -
[27/Jan/2024:15:06:45 +0000] 
"GET /Right-sized/Customer-focused_dedicated_complexity.png HTTP/1.1"
200 
1165 
"-" 
"Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/1952-20-02 Firefox/37.0"

IP address and trade-offs

The first line is the IP address in IPv4 format: four series of digits separated by periods. In regular expression, this can be matched with (\d+\.?){4}. We have \d for any digit (0-9) and a + to say we need 1 or more. Then a literal period \. that is optional ?. Finally, we want four of these.

I mentioned in the previous section that there are many ways to reach the same practical result, with varying tolerance to false positives.

(\d+\.?){4} matches our desired IP address of 106.169.54.184 but it also matches any number with 4 or more digits (like 10000 or 1234567890) or any sequence of numbers that has minimum of 4 digits and maximum of 4 periods between them (with the 4th one required to be at the end): 12.45 or 1.999.999., neither of which are valid IP addresses.

We could look into the spec of IPv4 and make sure that the sequence is exactly a legit IP address. This would look something like this (from Stack Overflow):

((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}

Depending on what we are doing with our parsing, one or the other might be a better option. The latter pattern is much more specific to an IP address but if we can trust that our input is always an IP address or that the downside of a false positive match is not bad, we could choose the earlier one.

Since the IP address is always at the start of the line, we can add ^ in the beginning to make sure we only match this pattern at the start.

Our pattern at this point:

^(\d+\.?){4}

Capturing groups

Sometimes it’s enough for us to know that something matches a pattern (if we for example want to filter out non-matching lines from some input) but often we want to capture individual parts of the input to be processed further.

In regular expressions, we can create capture groups by surrounding a part of a pattern with (). For example, to capture our entire IP address from previous pattern, we can do:

^((\d+\.?){4})

Here, the first and last parenthesis are used to signal that we want everything that pattern matches to be stored in a group. In this example, we’ll have a lot of things we might want to capture, so it’s a good practice to give these groups names. This can be done by starting the inside of parenthesis with ?P<name> where name is the name you want to give this group:

^(?P<ip>(\d+\.?){4})

A full example in Python would be:

import re

line = '106.169.54.184'

## Without named group
pattern = r'^((\d+\.?){4})'

matches = re.match(pattern, line)

print(matches.groups())
# should print
# ('106.169.54.184', '184')

## With named group
pattern = r'^(?P<ip>(\d+\.?){4})'

matches = re.match(pattern, line)

print(matches.groupdict())
# should print
# { 'ip': '106.169.54.184' }

Non-capture groups

As you can see from the example above, sometimes we need to create groups (to apply quantifiers to them) but we don’t really care about them in the final result.

We can mark these groups with ?: to tell the engine to not capture these groups in the final result.

import re

line = '106.169.54.184'

## Without named group and not matching inner group
pattern = r'^((?:\d+\.?){4})'

matches = re.match(pattern, line)

print(matches.groups())
# should print
# ('106.169.54.184',)

Parsing the rest

Next, there’s a limiter of - - which we can match with literals:

r'^(?P<ip>(?:\d+\.?){4})' - - 

Next is a timestamp inside square brackets. Once again here, it might be beneficial to match exactly to the pattern of legal timestamps or we can just capture everything inside square brackets:

\[.*\]

Here, we escape the square brackets with (\[ and \] to tell the regex engine that we want to match exact characters [ and ]. Inside, we match anything (.) and any number of them (*).

Alternatively, you can be more specific (adapted from Stack Overflow):

\[(?P<day>[0-9]{2})\/(?P<month>[a-zA-Z]{3})\/(?P<year>[0-9]{4}):(?P<hour>[0-9]{2}):(?P<minute>[0-9]{2}):(?P<second>[0-9]{2})\s+(?P<timezone>[+-][0-9]{4})\]

At least for me, a lot of creating regexes is a very iterative process.

I often use either Python REPL or online tools like regex101.com to build these complex patterns.

Following the timestamp, we got the request in a format with its type, path and protocol:

"GET /Right-sized/Customer-focused_dedicated_complexity.png HTTP/1.1"

The type of the request is one of the HTTP request methods. We have multiple options for how to manage with this pattern as well.

First, we could list out all the possible values:

(?P<method>GET|HEAD|POST|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)

or we can make a more generic pattern and match any word in capitals:

(?P<method>[A-Z]+)

Since there’s only a small amount of those valid names, I’d choose the first option for clarity and simplicity.

The next part of the request is a path. Since it can be almost anything, I’d match anything between the whitespaces:

\s(?P<path>.*?)\s

Finally for request, we match the protocol:

(?P<protocol>.*?)

If we know for sure which protocols we are matching towards, we could create a much more specific one but to be honest, I have no idea what the spec for this protocol part is.

In both of these previous ones, there’s an extra ? after the * quantifier. This makes the “get everything” pattern of .* ”lazy” which means it will stop as soon as the other parts of the pattern allow. There’s a good explanation fo greediness and laziness of regular expressions at https://www.regular-expressions.info/repeat.html.

Before we can move on, we need to wrap this full request pattern inside double quotes:

\"(?P<method>GET|HEAD|POST|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)\s(?P<path>.*?)\s(?P<protocol>.*?)\"

Next part is matching the HTTP status. HTTP status codes tell the computer (or user) in numeric form, what happened with the request. It’s always a number in the range of 100-599 with only few of them being actually valid numbers.

We once again need to come up with a error threshold we’re okay with. Here, I decided to go with one that matches it between that range but isn’t concerned with whether they are all legit codes or not.

(?P<status_code>[1-5][0-9][0-9])

Next, we have the size of server response in bytes:

(?P<size>\d+)

Next one is potential HTTP referrer that is once again anything inside the double quotes:

\"(?P<referrer>.*?)\"

and after that, the user agent which is the same pattern as referrer:

\"(?P<user_agent>.*?)\"

When we put all of these together, we get:

^(?P<ip>(?:\d+\.?){4}) - - \[(?P<timestamp>.*)\] \"(?P<action>GET|PUT|PATCH|HEAD|DELETE|POST)\s(?P<path>.*?)\s(?P<protocol>.*?)\" (?P<status_code>[1-5][0-9][0-9]) (?P<size>\d+) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\"$

which parses each line and finds individual elements into a dictionary with named keys.

import re

pattern = r'^(?P<ip>(?:\d+\.?){4}) - - \[(?P<timestamp>.*)\] \"(?P<action>GET|PUT|PATCH|HEAD|DELETE|POST)\s(?P<path>.*?)\s(?P<protocol>.*?)\" (?P<status_code>[1-5][0-9][0-9]) (?P<size>\d+) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\"'
line = '106.169.54.184 - - [27/Jan/2024:15:06:45 +0000] "GET /Right-sized/Customer-focused_dedicated_complexity.png HTTP/1.1" 200 1165 "-" "Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/1952-20-02 Firefox/37.0"'

matches = re.match(pattern, line)
print(matches.groupdict())

prints

{
  'ip': '106.169.54.184',
  'timestamp': '27/Jan/2024:15:06:45 +0000',
  'action': 'GET',
  'path': '/Right-sized/Customer-focused_dedicated_complexity.png',
  'protocol': 'HTTP/1.1',
  'status_code': '200', 
  'size': '1165',
  'referrer': '-', 
  'user_agent': 'Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/1952-20-02 Firefox/37.0'
}

Conclusion

Regular expressions are very powerful but also cryptic looking, hard to read and error prone.

I find it often more readable and modifiable to parse strings with Python’s string tools like split() and replace and string slicing.

They are still one of my favorite things in programming and that’s why I wanted to kick off this Batteries included series with it.