Tag Archives: httpd

Parsing Apache Logs Into A Database

Background

Parsing Apache HTTPD access logs into a database structure allows you to easily run queries and reports against it to better understand web traffic and detect issues. By using the pip installable apache_log_parser[1] package and the sqlite3 python module[2] that comes as part of the standard python library, we can quickly parse these access logs and insert entries into a sqlite3 database.

Approach Notes

Couple things to note on the approach taken in the code:

  • sqlite3 was selected for its native support and ease of use, but other DBMSes can be similarly used
  • An auto-incremented id column was not included in the schema as sqlite3 does not recommend such a column unless necessary[4]. You will likely want to add one for other database systems
  • apache_log_parser produces a dictionary of apache log entry parts. Given this, the available dictionary keys are dependent on the log pattern of the access.log which supplied to the make_parser() function
  • The columns in the code snippet are just a subset of the total keys/values returned by the parser. You can easily see all keys and values generated from a given line by calling pprint() on it
  • An additional ‘date’ key and value are added in the code (in a sqlite date function friendly way) to allow for date based queries and criteria

Code

#!/usr/bin/env python3

import sys
import sqlite3
import apache_log_parser

if len(sys.argv) != 2:
    print("Usage:", sys.argv[0], "/path/to/access.log")
    exit(1)

conn = sqlite3.connect('/tmp/logs.db')
cur = conn.cursor()

cur.execute("""
                CREATE TABLE IF NOT EXISTS logs (
                    status INTEGER,
                    request_method TEXT,
                    request_url TEXT,
                    date TEXT
                )
            """)

# Pattern below is from the LogFormat setting in apache2.conf/httpd.conf file
# You will likely need to change this value to the pattern your system uses
parser = apache_log_parser.make_parser(
                   "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\""
         )

log_file = sys.argv[1]

with open(log_file) as f:

    for line in f:

        d = parser(line)

        # Line below adds minimalistic date stamp column 
        # in format that sqlite3 date functions can work with
        d['date'] = d['time_received_datetimeobj'].date().isoformat()

        cur.execute("""
                        INSERT INTO logs ( status,  request_method,  request_url,  date)
                                  VALUES (:status, :request_method, :request_url, :date)
                        """, d)

cur.close()

conn.commit();
conn.close();

Example Run

Sample access.log entries:

# cat /var/log/apache2/access.log
::1 - - [25/Nov/2018:20:13:28 -0500] "GET /index.html HTTP/1.1" 200 3380 "-" "User agent string"
::1 - - [25/Nov/2018:12:15:26 -0500] "GET /notthere HTTP/1.1" 404 498 "-" "User agent string"

Run the code above:

$ ./httpd_log_parser.py /var/log/apache2/access.log

Examine the results:

$ sqlite3 -header -column /tmp/logs.db 'SELECT * FROM logs'
status      request_method  request_url  date      
----------  --------------  -----------  ----------
200         GET             /index.html  2018-11-25
404         GET             /notthere    2018-11-25

Sample Queries

After loading log entries you can start running useful queries on them.

  • Show number of requests by response status:
      SELECT count(1) as nReqs,
             status
        FROM logs
    GROUP BY (status)
    ORDER BY nReqs DESC;
    
    nReqs       status    
    ----------  ----------
    4           200       
    2           404   
    
  • What are the most common 404’d URLs:
      SELECT count(1) as nReqs,
             request_url
        FROM logs
       WHERE status = 404
    GROUP BY (request_url)
    ORDER BY nReqs DESC;
    
    nReqs       request_url
    ----------  -----------
    2           /notthere 
    

References

  1. GitHub – rory/apache-log-parser – https://github.com/rory/apache-log-parser
  2. [Python] sqlite3 — DB-API 2.0 interface for SQLite databases – https://docs.python.org/3/library/sqlite3.html
  3. [SQLite] Date And Time Functions – https://www.sqlite.org/lang_datefunc.html
  4. SQLite Autoincrement – https://www.sqlite.org/autoinc.html