Dragonfly Photo

Parsing Apache Logs Into A SQL Database

Background

Parsing Apache HTTPD access logs into a database structure allows you to easily run queries and reports against it to better understand web traffic and detect issues. By using the pip installable apache_log_parser[1] package and the sqlite3 python module[2] that comes as part of the standard python library, we can quickly parse these access logs and insert entries into a sqlite3 database.

Approach Notes

Couple things to note on the approach taken in the code:

  • sqlite3 was selected for its native support and ease of use, but other DBMSes can be similarly used
  • An auto-incremented id column was not included in the schema as sqlite3 does not recommend such a column unless necessary[4]. You will likely want to add one for other database systems
  • apache_log_parser produces a dictionary of apache log entry parts. Given this, the available dictionary keys are dependent on the log pattern of the access.log which supplied to the make_parser() function
  • The columns in the code snippet are just a subset of the total keys/values returned by the parser. You can easily see all keys and values generated from a given line by calling pprint() on it
  • An additional 'date' key and value are added in the code (in a sqlite date function friendly way) to allow for date based queries and criteria

Code

#!/usr/bin/env python3

import sys
import sqlite3
import apache_log_parser

if len(sys.argv) != 2:
    print("Usage:", sys.argv[0], "/path/to/access.log")
    exit(1)

conn = sqlite3.connect('/tmp/logs.db')
cur = conn.cursor()

cur.execute("""
                CREATE TABLE IF NOT EXISTS logs (
                    status INTEGER,
                    request_method TEXT,
                    request_url TEXT,
                    date TEXT
                )
            """)

# Pattern below is from the LogFormat setting in apache2.conf/httpd.conf file
# You will likely need to change this value to the pattern your system uses
parser = apache_log_parser.make_parser(
                   "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\""
         )

log_file = sys.argv[1]

with open(log_file) as f:

    for line in f:

        d = parser(line)

        # Line below adds minimalistic date stamp column 
        # in format that sqlite3 date functions can work with
        d['date'] = d['time_received_datetimeobj'].date().isoformat()

        cur.execute("""
                        INSERT INTO logs ( status,  request_method,  request_url,  date)
                                  VALUES (:status, :request_method, :request_url, :date)
                        """, d)

cur.close()

conn.commit();
conn.close();

Example Run

Sample access.log entries:

# cat /var/log/apache2/access.log
::1 - - [25/Nov/2018:20:13:28 -0500] "GET /index.html HTTP/1.1" 200 3380 "-" "User agent string"
::1 - - [25/Nov/2018:12:15:26 -0500] "GET /notthere HTTP/1.1" 404 498 "-" "User agent string"

Run the code above:

$ ./httpd_log_parser.py /var/log/apache2/access.log

Examine the results:

$ sqlite3 -header -column /tmp/logs.db 'SELECT * FROM logs'
status      request_method  request_url  date      
----------  --------------  -----------  ----------
200         GET             /index.html  2018-11-25
404         GET             /notthere    2018-11-25

Sample Queries

After loading log entries you can start running useful queries on them.

  • Show number of requests by response status:
      SELECT count(1) as nReqs,
             status
        FROM logs
    GROUP BY (status)
    ORDER BY nReqs DESC;
    
    nReqs       status    
    ----------  ----------
    4           200       
    2           404   
    
  • What are the most common 404'd URLs:
      SELECT count(1) as nReqs,
             request_url
        FROM logs
       WHERE status = 404
    GROUP BY (request_url)
    ORDER BY nReqs DESC;
    
    nReqs       request_url
    ----------  -----------
    2           /notthere 
    

Final Thoughts

The above method processes log files after they have been written out by the apache process. This can be particularly useful as log files can be offloaded to a different system and analyzed with your DB of choice without impacting the webserver performance. An alternative to this approach is to use something like mod_log_sql (aka libapache2-mod-log-sql) which is a purpose-built apache module that logs requests directly to a MySQL DB at request time. There will be trade-offs to weigh with either option but hopefully the information provided will give you a head start on your journey. Take care and good luck!

References

  1. GitHub - rory/apache-log-parser - https://github.com/rory/apache-log-parser
  2. [Python] sqlite3 — DB-API 2.0 interface for SQLite databases - https://docs.python.org/3/library/sqlite3.html
  3. [SQLite] Date And Time Functions - https://www.sqlite.org/lang_datefunc.html
  4. SQLite Autoincrement - https://www.sqlite.org/autoinc.html