{"id":225,"date":"2018-11-25T22:20:10","date_gmt":"2018-11-26T03:20:10","guid":{"rendered":"http:\/\/bhoey.com\/blog\/?p=225"},"modified":"2020-06-24T21:54:11","modified_gmt":"2020-06-25T01:54:11","slug":"parsing-apache-logs-into-a-database","status":"publish","type":"post","link":"https:\/\/bhoey.com\/blog\/parsing-apache-logs-into-a-database\/","title":{"rendered":"Parsing Apache Logs Into A SQL Database"},"content":{"rendered":"<h3>Background<\/h3>\n<p>Parsing Apache HTTPD access logs into a database structure allows you to easily run queries and reports against it to better understand web traffic and detect issues. By using the pip installable apache_log_parser<sup>[<a href=\"#ref_apache_log_parser\">1<\/a>]<\/sup> package and the sqlite3 python module<sup>[<a href=\"#ref_sqlite3\">2<\/a>]<\/sup> that comes as part of the standard python library, we can quickly parse these access logs and insert entries into a sqlite3 database.<\/p>\n<h3>Approach Notes<\/h3>\n<p>Couple things to note on the approach taken in the code:<\/p>\n<ul>\n<li>sqlite3 was selected for its native support and ease of use, but other DBMSes can be similarly used<\/li>\n<li>An auto-incremented id column was not included in the schema as sqlite3 does not recommend such a column unless necessary<sup>[<a href=\"#ref_sqlite3_autoinc\">4<\/a>]<\/sup>. You will likely want to add one for other database systems<\/li>\n<li>apache_log_parser produces a dictionary of apache log entry parts. Given this, the available dictionary keys are dependent on the log pattern of the access.log which supplied to the make_parser() function<\/li>\n<li>The columns in the code snippet are just a subset of the total keys\/values returned by the parser. You can easily see all keys and values generated from a given line by calling pprint() on it<\/li>\n<li>An additional 'date' key and value are added in the code (in a sqlite date function friendly way) to allow for date based queries and criteria<\/li>\n<\/ul>\n<h3>Code<\/h3>\n<pre class=\"brush: python; notranslate\">#!\/usr\/bin\/env python3\n\nimport sys\nimport sqlite3\nimport apache_log_parser\n\nif len(sys.argv) != 2:\n    print(\"Usage:\", sys.argv[0], \"\/path\/to\/access.log\")\n    exit(1)\n\nconn = sqlite3.connect('\/tmp\/logs.db')\ncur = conn.cursor()\n\ncur.execute(\"\"\"\n                CREATE TABLE IF NOT EXISTS logs (\n                    status INTEGER,\n                    request_method TEXT,\n                    request_url TEXT,\n                    date TEXT\n                )\n            \"\"\")\n\n# Pattern below is from the LogFormat setting in apache2.conf\/httpd.conf file\n# You will likely need to change this value to the pattern your system uses\nparser = apache_log_parser.make_parser(\n                   \"%h %l %u %t \\\"%r\\\" %&gt;s %O \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\"\n         )\n\nlog_file = sys.argv[1]\n\nwith open(log_file) as f:\n\n    for line in f:\n\n        d = parser(line)\n\n        # Line below adds minimalistic date stamp column \n        # in format that sqlite3 date functions can work with\n        d['date'] = d['time_received_datetimeobj'].date().isoformat()\n\n        cur.execute(\"\"\"\n                        INSERT INTO logs ( status,  request_method,  request_url,  date)\n                                  VALUES (:status, :request_method, :request_url, :date)\n                        \"\"\", d)\n\ncur.close()\n\nconn.commit();\nconn.close();\n<\/pre>\n<h3>Example Run<\/h3>\n<p>Sample access.log entries:<\/p>\n<pre class=\"brush: plain; notranslate\"># cat \/var\/log\/apache2\/access.log\n::1 - - [25\/Nov\/2018:20:13:28 -0500] \"GET \/index.html HTTP\/1.1\" 200 3380 \"-\" \"User agent string\"\n::1 - - [25\/Nov\/2018:12:15:26 -0500] \"GET \/notthere HTTP\/1.1\" 404 498 \"-\" \"User agent string\"\n<\/pre>\n<p>Run the code above:<\/p>\n<pre class=\"brush: bash; notranslate\">$ .\/httpd_log_parser.py \/var\/log\/apache2\/access.log\n<\/pre>\n<p>Examine the results:<\/p>\n<pre class=\"brush: plain; notranslate\">$ sqlite3 -header -column \/tmp\/logs.db 'SELECT * FROM logs'\nstatus      request_method  request_url  date      \n----------  --------------  -----------  ----------\n200         GET             \/index.html  2018-11-25\n404         GET             \/notthere    2018-11-25\n<\/pre>\n<h3>Sample Queries<\/h3>\n<p>After loading log entries you can start running useful queries on them.<\/p>\n<ul>\n<li>Show number of requests by response status:\n<pre class=\"brush: sql; notranslate\">  SELECT count(1) as nReqs,\n         status\n    FROM logs\nGROUP BY (status)\nORDER BY nReqs DESC;\n<\/pre>\n<pre class=\"brush: plain; notranslate\">nReqs       status    \n----------  ----------\n4           200       \n2           404   \n<\/pre>\n<\/li>\n<li>What are the most common 404'd URLs:\n<pre class=\"brush: sql; notranslate\">  SELECT count(1) as nReqs,\n         request_url\n    FROM logs\n   WHERE status = 404\nGROUP BY (request_url)\nORDER BY nReqs DESC;\n<\/pre>\n<pre class=\"brush: plain; notranslate\">nReqs       request_url\n----------  -----------\n2           \/notthere \n<\/pre>\n<\/li>\n<\/ul>\n<h3>Final Thoughts<\/h3>\n<p>The above method processes log files after they have been written out by the apache process. This can be particularly useful as log files can be offloaded to a different system and analyzed with your DB of choice without impacting the webserver performance. An alternative to this approach is to use something like <a href=\"http:\/\/www.outoforder.cc\/projects\/httpd\/mod_log_sql\/\">mod_log_sql<\/a> (aka <a href=\"https:\/\/packages.debian.org\/buster\/libapache2-mod-log-sql-mysql\">libapache2-mod-log-sql<\/a>) which is a purpose-built apache module that logs requests directly to a MySQL DB at request time. There will be trade-offs to weigh with either option but hopefully the information provided will give you a head start on your journey. Take care and good luck!<\/p>\n<h3>References<\/h3>\n<ol>\n<li><a id=\"ref_apache_log_parser\"><\/a>GitHub - rory\/apache-log-parser - <a href=\"https:\/\/github.com\/rory\/apache-log-parser\">https:\/\/github.com\/rory\/apache-log-parser<\/a><\/li>\n<li><a id=\"ref_sqlite3\"><\/a>[Python] sqlite3 \u2014 DB-API 2.0 interface for SQLite databases - <a href=\"https:\/\/docs.python.org\/3\/library\/sqlite3.html\">https:\/\/docs.python.org\/3\/library\/sqlite3.html<\/a><\/li>\n<li><a id=\"ref_sqlite3_datefunc\"><\/a>[SQLite] Date And Time Functions - <a href=\"https:\/\/www.sqlite.org\/lang_datefunc.html\">https:\/\/www.sqlite.org\/lang_datefunc.html<\/a><\/li>\n<li><a id=\"ref_sqlite3_autoinc\"><\/a>SQLite Autoincrement - <a href=\"https:\/\/www.sqlite.org\/autoinc.html\">https:\/\/www.sqlite.org\/autoinc.html<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Parsing Apache HTTPD access logs into a database structure allows you to easily run queries and reports against it to better understand web traffic and detect issues. By using the pip installable apache_log_parser package and the sqlite3 python module that comes as part of the standard python library, we can quickly parse these access logs and insert entries into a sqlite3 database.&nbsp;<a href=\"https:\/\/bhoey.com\/blog\/parsing-apache-logs-into-a-database\/\">[Continue&nbsp;reading...] <span class=\"screen-reader-text\">Parsing Apache Logs Into A SQL Database<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":825,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[23],"tags":[47,86,48,85,49,87],"_links":{"self":[{"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/posts\/225"}],"collection":[{"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/comments?post=225"}],"version-history":[{"count":30,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/posts\/225\/revisions"}],"predecessor-version":[{"id":828,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/posts\/225\/revisions\/828"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/media\/825"}],"wp:attachment":[{"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/media?parent=225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/categories?post=225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bhoey.com\/blog\/wp-json\/wp\/v2\/tags?post=225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}