Category Archives: Python

Parsing Apache Logs Into A Database

Background

Parsing Apache HTTPD access logs into a database structure allows you to easily run queries and reports against it to better understand web traffic and detect issues. By using the pip installable apache_log_parser[1] package and the sqlite3 python module[2] that comes as part of the standard python library, we can quickly parse these access logs and insert entries into a sqlite3 database.

Approach Notes

Couple things to note on the approach taken in the code:

  • sqlite3 was selected for its native support and ease of use, but other DBMSes can be similarly used
  • An auto-incremented id column was not included in the schema as sqlite3 does not recommend such a column unless necessary[4]. You will likely want to add one for other database systems
  • apache_log_parser produces a dictionary of apache log entry parts. Given this, the available dictionary keys are dependent on the log pattern of the access.log which supplied to the make_parser() function
  • The columns in the code snippet are just a subset of the total keys/values returned by the parser. You can easily see all keys and values generated from a given line by calling pprint() on it
  • An additional ‘date’ key and value are added in the code (in a sqlite date function friendly way) to allow for date based queries and criteria

Code

#!/usr/bin/env python3

import sys
import sqlite3
import apache_log_parser

if len(sys.argv) != 2:
    print("Usage:", sys.argv[0], "/path/to/access.log")
    exit(1)

conn = sqlite3.connect('/tmp/logs.db')
cur = conn.cursor()

cur.execute("""
                CREATE TABLE IF NOT EXISTS logs (
                    status INTEGER,
                    request_method TEXT,
                    request_url TEXT,
                    date TEXT
                )
            """)

# Pattern below is from the LogFormat setting in apache2.conf/httpd.conf file
# You will likely need to change this value to the pattern your system uses
parser = apache_log_parser.make_parser(
                   "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\""
         )

log_file = sys.argv[1]

with open(log_file) as f:

    for line in f:

        d = parser(line)

        # Line below adds minimalistic date stamp column 
        # in format that sqlite3 date functions can work with
        d['date'] = d['time_received_datetimeobj'].date().isoformat()

        cur.execute("""
                        INSERT INTO logs ( status,  request_method,  request_url,  date)
                                  VALUES (:status, :request_method, :request_url, :date)
                        """, d)

cur.close()

conn.commit();
conn.close();

Example Run

Sample access.log entries:

# cat /var/log/apache2/access.log
::1 - - [25/Nov/2018:20:13:28 -0500] "GET /index.html HTTP/1.1" 200 3380 "-" "User agent string"
::1 - - [25/Nov/2018:12:15:26 -0500] "GET /notthere HTTP/1.1" 404 498 "-" "User agent string"

Run the code above:

$ ./httpd_log_parser.py /var/log/apache2/access.log

Examine the results:

$ sqlite3 -header -column /tmp/logs.db 'SELECT * FROM logs'
status      request_method  request_url  date      
----------  --------------  -----------  ----------
200         GET             /index.html  2018-11-25
404         GET             /notthere    2018-11-25

Sample Queries

After loading log entries you can start running useful queries on them.

  • Show number of requests by response status:
      SELECT count(1) as nReqs,
             status
        FROM logs
    GROUP BY (status)
    ORDER BY nReqs DESC;
    
    nReqs       status    
    ----------  ----------
    4           200       
    2           404   
    
  • What are the most common 404’d URLs:
      SELECT count(1) as nReqs,
             request_url
        FROM logs
       WHERE status = 404
    GROUP BY (request_url)
    ORDER BY nReqs DESC;
    
    nReqs       request_url
    ----------  -----------
    2           /notthere 
    

References

  1. GitHub – rory/apache-log-parser – https://github.com/rory/apache-log-parser
  2. [Python] sqlite3 — DB-API 2.0 interface for SQLite databases – https://docs.python.org/3/library/sqlite3.html
  3. [SQLite] Date And Time Functions – https://www.sqlite.org/lang_datefunc.html
  4. SQLite Autoincrement – https://www.sqlite.org/autoinc.html

Extracting RAW Photo Exif Data With Python

Background

I’ve always enjoyed taking pictures. Before smartphones put a camera in everyone’s pocket, I would use whatever (usually cheap) camera was handy. More recently I decided to get more into taking “real photographs” instead of just pictures. As anyone will tell you taking the dive into photography, there is much more complexity than you would normally expect, particularly when compared to point-and-shoot or smartphone cameras. As part of the journey, you will inevitably need to work with RAW image files. While RAW files provide much more information to work with, you may find support for different manufacturers RAW format lacking in tools and libraries that you may have used in the past. This post briefly covers extracting Exif metadata from most RAW file formats with python.

bug photo

pyexiv2

We will be using pyexiv2/py3exiv2 for this example, which are python wrappers around the exiv2 C++ library that does the actual image file processing.

  • pyexiv2 is the python 2.x version of the library
  • py3exiv2 is the python 3.x version

The code snippet below will work with with either pyexiv2 or py3exiv2.

Setup

pyexiv2

For Python 2.x, the current Debian stable as of this writing (Debian 9 Stretch) already includes pyexiv2 in its package repository, you only need to run:

apt-get install python-pyexiv2

py3exiv2

For Python 3.x, since the py3exiv2 isn’t available from apt, you’ll need to build it locally. To do this you’ll need to install some prerequisites:

apt-get install \
        build-essential \
        python-all-dev \
        libboost-python-dev \
        libexiv2-dev

Then use pip3 to download, build and install py3exiv2:

pip3 install py3exiv2

exiv2_demo.py

The following code snippet will:

  • Read a file given on the command line
  • Print all Exif tags found in the file
  • Print a specific tag value

Code

#!/usr/bin/env python3

# usage: exiv2_demo.py <file>

import sys
import pyexiv2

file = sys.argv[1]

md = pyexiv2.ImageMetadata(file)
md.read()

# print all exif tags in file
for m in md.exif_keys:
    print(m + "=" + str(md[m]))

# print specific tag 
aperture = float(md['Exif.Photo.FNumber'].value)
print("Aperture: F{}".format(aperture))

Note 1: One thing to keep in mind is that different Exif keys will return different value types. For example Exif.Photo.FNumber returns a Rational type which was converted above to float to print how aperture is commonly expressed. The ‘Exif metadata reference tables’ link in the References section has a list of tags and their respective types

Note 2: Different RAW file formats (.ARW, .CR2, .NEF, etc) will often have a different set of Exif tags, particularly for lesser used/uncommon ones (for example Exif.Photo.LensModel).

Example Run

$ ./exiv2_demo.py test.sony.ARW

Exif.Photo.ExposureTime=<Exif.Photo.ExposureTime [Rational] = 1/125>
Exif.Photo.FNumber=<Exif.Photo.FNumber [Rational] = 40/10>
...
Exif.Photo.ISOSpeedRatings=<Exif.Photo.ISOSpeedRatings [Short] = 1600>
Aperture: F4.0

Dockerfile

If you’d like to experiment in a Docker sandbox, the following Dockerfile will spin up a docker container with py3exiv2 and run the script:

FROM python:3.6-stretch

RUN apt-get update && \
    apt-get install -y \
        build-essential \
        python-all-dev \
        libboost-python-dev \
        libexiv2-dev 

COPY \
    exiv2_demo.py \
    test.sony.ARW \
    ./

RUN pip3 install py3exiv2

CMD [ "python3", "./exiv2_demo.py", "test.sony.ARW" ]

References

  1. Exiv2 project: http://www.exiv2.org/
  2. py3exiv2 project: https://launchpad.net/py3exiv2
  3. EXIF metadata reference tables: http://www.exiv2.org/tags.html
  4. EXIF tag value chart: https://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html