RSS Feed
2013-12-10 06:27:49 UTC
Ron

Suppose you want to find the most occuring element within a data stream. A simple brute force algorithm would require counting the occurances of each element, then calculating which one had the highest count. This could cause some non-trivial memory problems down the line if the data stream is large. What if your data stream is infinite? It is not possible to do a complete pass over an infinite stream. Because of this, it is also impossible to find "the most occuring element" in an infinite data stream while using a brute force algorithm.

Since it is not possible to find the "most occuring element" in an infinite data stream, the best we can do is approximate it. Luckily there is a wonderful algorithm called space-saving that can approximate the top-k elements in an infinite stream.

Let's run through the algorithm and test it out on some data. I chose Data.Map.Lazy for this example for its ubiquity and ease of use. If you are looking for the best performance, then you would likely need to use a different data structure. Start by naming the module and importing Data.Map.Lazy and a few helper functions from Data.Char and Data.List.

> module Main where
> import qualified Data.Map.Lazy as M
> import Data.Char (toLower)
> import Data.List (minimumBy, maximumBy)

Now we can define the type for our spaceSaving function. The function first requires an Int called k, which defines how many elements the algorithm will keep track of. Next it requires a stream of data of the polymorphic a type; make sure your a has an instance of Ord. Finally the algorithm outputs a Map of data, containing the top-k elements, each paired with a score.

> spaceSaving :: (Ord a) 
>             => Int              -- The amount of results to keep
>             -> [a]              -- The stream 
>             -> M.Map a Integer   -- A map of the top elements as the key, and their count

Our entry into the space-saving function requires a k value and a stream of data. Upon being set into motion, the algorithm begins analyzing the data stream. It keeps track of a maximum k elements and stores them in a map. If there are less than k elements in the map, then it will add the current element into the map and give it a score of 1. If it sees a duplicate element that already exists in the map, then it will increment the score for that specific element. It will continue doing this until the map has k elements in it. Once the map is full, updating of duplicate elements continues as before, however it then runs the "updateLowestKey" function upon detection of an element not currently in the map.

> spaceSaving k = spaceSave M.empty 
>   where
>     spaceSave m [] = m
>     spaceSave m (s:ss) 
>       | M.member s m = update $ M.adjust (+ 1) s m 
>       | M.size m < k = update $ M.insert s 1 m 
>       | otherwise    = update $ updateLowestKey m s 
>      where
>       update f = spaceSave f ss

The function updateLowestKey takes the current lazy map, and the current element, then returns an updated map.

> updateLowestKey :: (Ord k) 
>                  => M.Map k Integer -- Input map
>                  -> k              -- Input Element
>                  -> M.Map k Integer -- Updated map 

The updateLowestKey function takes note of the element with the current lowest score in the map. It then replaces the lowest scored element with the element passed to the updateLowestKey function. It will increment this element by 1 before passing it back to the spaceSave function.

> updateLowestKey m e = M.insert e (lowestValue + 1) lowestKeyRemoved
>   where
>     (lowestKey, lowestValue) = minimum' m
>     lowestKeyRemoved = M.delete lowestKey m

There are two utility functions for this algorithm to work correctly with Data.Map.Lazy. We need a function to calculate the minimum and maximum values from a map then return the found element as a singleton. I found this easiest by converting the map to a list, then using compare to find the minimum or maximum value. If you are using a large value for k, then you might run into a bottleneck with these utility functions. Such a bottleneck would likely require ditching Data.Map.Lazy and either choosing a new data structure or rolling your own.

> minimum' :: (Ord a) => M.Map k a -> (k, a)
> minimum' = minimumBy comp . M.toList
>
> maximum' :: (Ord a) => M.Map k a -> (k, a)
> maximum' = maximumBy comp . M.toList
>
> comp (_, x) (_, y) = compare x y

Our main function uses getContents to read lazily from stdin, then splits the input into words while making everything lower case. We make everything lower case because spaceSaving is case-sensitive. Depending on your input data, you will want to write your own pFile function. After the calculation is complete, we print the most occuring element to stdout.

> main :: IO ()
> main = do
>   contents <- getContents
>   let pFile = words . map toLower $ contents
>   print . maximum' $ spaceSaving 100 pFile

Now lets test it! You can compile the code like this:

[ron ss]# ghc -O2 -o spacesaving Spacesaving.hs

And we can test the code by piping in some text, in this scenario I will use the "Chinese Classics".

[ron ss]# curl latermuse.com/classic.txt | ./spacesaving

The program should spit this out:

fromList [("the",1172)]

This means the word "the" is probably the most occuring word in our text. If you want to see the entire map, you can modify the print line in the main function and remove the maximum' function. Experimenting with k can have different results and it is worthwhile to test the algorithm using different values for k.

For anyone interested in working more with this algorithm, there is a way to record and calculate the error ratio of chosen elements. You could also implement the algorithm into a state monad, or find a way to query the data simultaneously as it is being computed. You might also consider writing an IO version of spaceSaving that writes your current maximum element into stdout or into a file. Enjoy!

Grab this code on Github.

View Comments on Reddit.

Thanks to winterkoninkje for awesome refactoring

Archives