RSS Feed
2013-11-20 14:59:40 UTC

After demonstrating a simple script that extracts the total amount of unique IPs in an Apache server log, I decided that today I will make something a bit more complicated to show more of Haskell's power. This following literate Haskell program is an IP microscope. This microscope will zoom in on the unique IPs in an Apache log and show exactly which pages each IP has visited and the amount of visits the IP made to each page.

Lets get started by naming our module and importing our libraries.

> module Main where
> import qualified Data.Map.Lazy as M
> import Control.Arrow

Parsed is a data type that will hold information about the lines that are being parsed by our program. '_visitorIp' is the ip address of a user in String form, and '_visitedPage' is the page that the ip address visited.

> data Parsed = Parsed {
>     _visitorIp :: String,
>     _visitedPage :: String
>   } deriving (Show, Eq, Ord)

Ill admit this parser is pretty weak and could fail in many ways, but it does the job for this script. If you wanted to make this parser more robust, you would definitely want to avoid using 'head' and (!!) because they can both fail with exceptions (exceptions can cause some major headache if you aren't used to them). The parser works by first grabbing the 'head' of the string passed to it (which should be the IP address), then grabbing the 10th word in the string (which should be the web url). The parser then outputs the IP and URL into the "Parsed" data type. Remember, this parser is weaksauce and your mileage may vary if you chose to use it.

> parse :: String -> Parsed
> parse x = Parsed (parseIP x) (parsePage x)
>   where
>     parseIP = head . words 
>     parsePage = (!! 10) . words

This 'countParsed' function is pretty tricky. It zips a list of "Parsed" data with the integer 1, then lazily builds a 'Map' (from Data.Map.Lazy) with the zipped list. While building the 'Map', it will automatically add up all the duplicate 'Parsed' entries before spitting out the 'Map'. You will end up with a Map of the Parsed data with an Integer count of how many times the Parsed data was duplicated across your entire input.

> countParsed :: [Parsed] -> M.Map Parsed Integer
> countParsed = M.fromListWith (+) . flip zip [1,1..]

To prevent myself from becoming confused, I wrapped up the previous functions into a 'calculate' function that composes them together and finally converts the data from Map back into a normal list. This is probably not the most optimized way to go about this, but it is pretty readable so I opted to keep it. Sometimes readability is more important than optimization, especially with a script that already runs within a reasonable amount of time for the given data set.

> calculate :: String -> [(Parsed, Integer)]
> calculate = M.toList . countParsed . map parse . lines

Using Control.Arrow's (&&&) fanout function, I further caress my data into a type that is more suitable for my needs. "visitorIp' extracts the IP address from the Parsed data type, and "pageViews" separates each instance of a website URL and matches it up with the total page views into a list of tuples. This might feel a bit convoluted at the moment, but it will make for some easier function composition down the line. The 'Control.Lens' library would likely provide a much easier and concise version of this function, but I chose not to use Lens for this script.

> formatData :: (Parsed, Integer) -> (String, [(String, Integer)])
> formatData = visitorIp &&& pageViews
> visitorIp :: (Parsed, Integer) -> String
> visitorIp (parsed,_) = _visitorIp parsed
> pageViews :: (Parsed, Integer) -> [(String, Integer)]
> pageViews (parsed, num) = [(_visitedPage parsed, num)]

This 'converge' function is the climax of our script. It successfully appends all of our page view data together and solidates them under their parent IP. I am again abusing the Data.Map library to correctly move all the data into their assigned seats before preparing for output.

> converge :: [(String, [a])] -> M.Map String [a]
> converge = M.fromListWith (++)

Pretty printing is pretty easy for this kind of data. You just concat a bunch of stuff together and use 'putStrLn' to output it.

> prettyPrint (ip, pages) = putStrLn $ 
>     concat [ ip, " has visited these pages:\n"
>            , concatMap prettyPages pages
>            ]
>   where
>     prettyPages (page, views) = 
>           concat [ "    ["
>                  , show views
>                  , pluralView views
>                  , page
>                  , "\n"
>                  ]
>     pluralView 1 = " View] "
>     pluralView _ = " Views] "

We are here at the end of our haskell script. In the 'main' function, getContents will grab all the data sent into the program via stdin and 'calculations' will do all of the pure calculating and data manipulation. The last line of our 'main' function will wrangle and prettyPrint our calculations. This function is pretty self-explainable if you have made it this far into this blog post already.

> main = do
>   contents <- getContents
>   let calculations = M.toList . converge . map formatData . calculate $ contents
>   mapM_ prettyPrint calculations

So what happens when we actually run this script? Lets give it a whirl. Ill run this script by grepping for "" in the access_log and piping it into the runhaskell binary which will interpret my script. (Note: For publishing online, I have censored the last digits of each IP number, and have replaced many many results with a cool elipsis).

[latermuse httpd]# grep "" access_log | runhaskell microscope.hs
117.32.153.XXX has visited these pages:

12.69.21.XXX has visited these pages:
    [1 View] ""
    [1 View] ""


So there you have it. A slightly more complicated command-line tool written in Haskell.

You can get the full source code of this script by clicking here.

View comments on Reddit

Thanks /u/dave4420 for your code review!