Category Archives: Open Source Projects

Lumberjack – lumberjack.nginx (version 0.1.0)

As I posted last time, lumberjack is my start of a log line analyzer/visualizer project in Clojure. This write up will cover the version 0.1.0 lumberjack.nginx namespace.

As this is a version 0.1.0, and to get it out, I am parsing Nginx log lines that take the following format, as all the log lines that I have been needing to parse match it.

173.252.110.27 - - [18/Mar/2013:15:20:10 -0500] "PUT /logon" 404 1178 "http://shop.github.com/products/octopint-set-of-2" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"

The function nginx-logs takes a sequence of Nginx log filenames to convert to a hash representing a log line by calling process-logfile on each one.

(defn nginx-logs [filenames]
  (mapcat process-logfile filenames))

The function process-logfile takes a single filename and gets the lines from the file using slurp, and then maps over each of the lines using the function parse-line.

(defn- logfile-lines [filename]
  (string/split-lines (slurp filename)))

(defn process-logfile [filename]
    (map parse-line (logfile-lines filename)))

At this point, this is sufficient for what I am needing, but have created an issue on the Github project to address large log files, and the ability to lazily read in the lines so the whole log file does not have to reside in memory.

The function parse-line, holds a regex, and does a match of each line against the pattern. It takes each part of the match and associates to a hash using the different parts of the log entry as a vector of the keywords that represent each part of the regex. This is done by reducing against an empty hash and taking the index of the part into match, the result of re-find.

(def parts [:original
            :ip
            :timestamp
            :request-method
            :request-uri
            :status-code
            :response-size
            :referrer])

(defn parse-line [line]
  (let [parsed-line {}
        pattern #"(d{1,3}.d{1,3}.d{1,3}.d{1,3})? - - [(.*)] "(w+) ([^"]*)" (d{3}) (d+) "([^"]*)".*"
        match (re-find pattern line)]
    (reduce (fn [memo [idx part]]
                (assoc memo part (nth match idx)))
            parsed-line (map-indexed vector parts))))

Looking at this again a few days later, I went and created and issue to pull out the definition of pattern into a different definition, outside of the let, and even the parse-line function. I also want to go back and clean up the parsed-line from the let statement as it does not need to be declared inside the let, but can just pass the empty hash to the reduce. This was setup there before I refactored to a reduce, and was just associating keys one at a time to the index of matched as I was adding parts of the log entry.

Any comments on this are welcome, and I will be posting details on the other files soon as well.

Thanks,
–Proctor

Lumberjack – Log file parsing and analysis for Clojure

I have just pushed a 0.1.0 version of a new project called Lumberjack. The goal is to be a library of functions to help parse and analyze log files in Clojure.

At work I have to occasionally pull down log files and do some visualization of log files from our Nginx webservers. I decided that this could be a useful project to play with to help me on my journey with Clojure and Open Source Software.

This library will read in a set of Nginx log files from a sequence, and parse them to a structure to be able to analyze them. It currently also provides functionality to be able to visualize the data as a set of time series graphs using Incanter, as that is currently the only graphing library I have seen so far.

A short future list of things I would like to be able to support that come to mind very quickly, and not at all comprehensive:

  • Update to support use of BufferedReader for very long log files so the whole file does not have to reside in memory before parsing, and take advantage of lazyness.
  • The ability to only construct records with a subset of the parsed data, such as request type, and timestamp.
  • The ability to parse log lines of different types, e.g. Apache, IIS or other formats
  • Additional graphs other than time series, e.g. bar graphs to show number of hits based off of IP Address.
  • Possibility of using futures, or another concurrency mechanism, to do some of the parsing and transformation of log lines into the data structures when working on large log files.

The above are just some of my thoughts on things that might fit well as updates to this as I start to use this more and flush out more use cases.

I would love comments on my code, and any other feedback that you may have. This is still early but I wanted to put something out there that might be of some use to others as well.

You can find Lumberjack on my Github account at https://github.com/stevenproctor/lumberjack.

Thanks for your comments and support.
–Proctor