Background
Parsing Apache HTTPD access logs into a database structure allows you to easily run queries and reports against it to better understand web traffic and detect issues. By using the pip installable apache_log_parser[1] package and the sqlite3 python module[2] that comes as part of the standard python library, we can quickly parse these access logs and insert entries into a sqlite3 database.
Approach Notes
Couple things to note on the approach taken in the code:
- sqlite3 was selected for its native support and ease of use, but other DBMSes can be similarly used
- An auto-incremented id column was not included in the schema as sqlite3 does not recommend such a column unless necessary[4]. You will likely want to add one for other database systems
- apache_log_parser produces a dictionary of apache log entry parts. Given this, the available dictionary keys are dependent on the log pattern of the access.log which supplied to the make_parser() function
- The columns in the code snippet are just a subset of the total keys/values returned by the parser. You can easily see all keys and values generated from a given line by calling pprint() on it
- An additional 'date' key and value are added in the code (in a sqlite date function friendly way) to allow for date based queries and criteria
Code
#!/usr/bin/env python3 import sys import sqlite3 import apache_log_parser if len(sys.argv) != 2: print("Usage:", sys.argv[0], "/path/to/access.log") exit(1) conn = sqlite3.connect('/tmp/logs.db') cur = conn.cursor() cur.execute(""" CREATE TABLE IF NOT EXISTS logs ( status INTEGER, request_method TEXT, request_url TEXT, date TEXT ) """) # Pattern below is from the LogFormat setting in apache2.conf/httpd.conf file # You will likely need to change this value to the pattern your system uses parser = apache_log_parser.make_parser( "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" ) log_file = sys.argv[1] with open(log_file) as f: for line in f: d = parser(line) # Line below adds minimalistic date stamp column # in format that sqlite3 date functions can work with d['date'] = d['time_received_datetimeobj'].date().isoformat() cur.execute(""" INSERT INTO logs ( status, request_method, request_url, date) VALUES (:status, :request_method, :request_url, :date) """, d) cur.close() conn.commit(); conn.close();
Example Run
Sample access.log entries:
# cat /var/log/apache2/access.log ::1 - - [25/Nov/2018:20:13:28 -0500] "GET /index.html HTTP/1.1" 200 3380 "-" "User agent string" ::1 - - [25/Nov/2018:12:15:26 -0500] "GET /notthere HTTP/1.1" 404 498 "-" "User agent string"
Run the code above:
$ ./httpd_log_parser.py /var/log/apache2/access.log
Examine the results:
$ sqlite3 -header -column /tmp/logs.db 'SELECT * FROM logs' status request_method request_url date ---------- -------------- ----------- ---------- 200 GET /index.html 2018-11-25 404 GET /notthere 2018-11-25
Sample Queries
After loading log entries you can start running useful queries on them.
- Show number of requests by response status:
SELECT count(1) as nReqs, status FROM logs GROUP BY (status) ORDER BY nReqs DESC;
nReqs status ---------- ---------- 4 200 2 404
- What are the most common 404'd URLs:
SELECT count(1) as nReqs, request_url FROM logs WHERE status = 404 GROUP BY (request_url) ORDER BY nReqs DESC;
nReqs request_url ---------- ----------- 2 /notthere
Final Thoughts
The above method processes log files after they have been written out by the apache process. This can be particularly useful as log files can be offloaded to a different system and analyzed with your DB of choice without impacting the webserver performance. An alternative to this approach is to use something like mod_log_sql (aka libapache2-mod-log-sql) which is a purpose-built apache module that logs requests directly to a MySQL DB at request time. There will be trade-offs to weigh with either option but hopefully the information provided will give you a head start on your journey. Take care and good luck!
References
- GitHub - rory/apache-log-parser - https://github.com/rory/apache-log-parser
- [Python] sqlite3 — DB-API 2.0 interface for SQLite databases - https://docs.python.org/3/library/sqlite3.html
- [SQLite] Date And Time Functions - https://www.sqlite.org/lang_datefunc.html
- SQLite Autoincrement - https://www.sqlite.org/autoinc.html