Monday, 24 November 2014

Generate a subset of pseudo-random permutations in non mega-huge time

One of the neat things about sending packages via sendle is that you don't have to print out a label or even know the address of your recipient (this is done so the recipient can keep their address private if they want). Instead, we give the sender a pair of unique code-names to put onto the package. So instead of J Random Shop-assistant at EvilCorp getting your full address, instead you write something like "From: Golden Lion To: Red Tiger".

It's fun, as well as effective - like secret-agent names.

but it does present us with an interesting problem - how do we generate unique codenames for potentially many, many packages.

Codenames are assigned for every package, not just for every person - that way your humble shop assistant at EvilCorp can't get any info on your shopping habits based on your codename cropping up time and again either. So that means we need lots and lots of codenames, and they must be pretty randomised too.

As mentioned, our codenames are basically of the form "<adjective> <noun>", and we have a medium-sized stock of adjectives and nouns to build from.

With 100 adjectives and 50 nouns, that's a total of 5000 possible codephrases... but the codephrase generated could be used as either the "from" or "to" (as long as it's not reused for a while), so choosing a pair of such codephrases from a set of 5000 (without repetition) gives us 5000*4999 (or 24995000) possible permutations.

That's a lot.

It takes up a fair bit of db space, and it would take quite some time to generate all the possibilities and randomise them too.

So we decided not to do that... but instead, to generate a much smaller sub-set, on the fly (ie when we got close to running out).

Which leaves us with a different problem... how to do that effectively.

The solution that was in place before I arrived, was done in SQL (because it's always faster to let the db do it, right?), and involved doing a four-way cross-join on the adjectives+nouns... and then ordering them randomly... and using a limit to only actually keep a set amount... and then only inserting into our db those that weren't already there (done using an insert where select = 0)

Sadly, when performance testing, this was discovered to be exceedingly slow... ie when it was triggered, it would take 15min or so to run, causing all other "create a package" processes to block waiting for a new code to be generated... which was unacceptable. Sure, we could run this overnight instead, but hey, what if a Good Problem occurred and we had so many orders one day that we ran out in between overnight-runs?

So instead I came up with a solution that would allow us to quickly generate a small subset of new, randomised permutations without killing throughput as badly. The solution takes about 4 seconds to generate 2000 new permutations, which is enough to get by until it's needed again.

In order to explain what it is, I first need to explain a few non-solutions we hit upon to begin with and why they didn't work for us

Solution 1: random indexing into the arrays

It seems like such a simple problem - we have a small set of adjectives and a small set of nouns, just randomly pick one of each and see if it's not already one that we have... create it if not. so the first solution was just to use "rand" on the size of each array to pick one of each by index.

This works reasonably well when the data are sparse... but when you start filling in those gaps, a lot of time is spent thrashing about randomly trying to find a pair that hasn't already been chosen. And there was no way of simply saying "you've already picked that pair" so it would pick the same pairs over and over and over and keep trying them against the set that had already been created.

This became obviously ridiculous in the specs where I chose two nouns and two adjectives... and it spent roughly 16min thrashing back and forth between "blue dog" and "red cat" trying to randomly guess the last remaining pair ("red dog").

I didn't want that to happen while angry users were waiting to post their parcel in the middle of the xmas rush... we needed a solution that would guarantee that it would not endlessly repeat useless, already-tried pairs.

Solution 2: random ordering of the arrays

The next obvious idea was: shuffle the two arrays so they are (pseudo)randomly sorted.... then just loop through them grabbing the next one in line to try and find_or_create.

eg, using a simple nested loop like:

   adjectives.shuffle.each do |adj|
      nouns.shuffle.each do |noun|
         Code.find_or_create(:code => "#{adj} #{noun}")

This solution works. It guarantees that you only try each option once, and it's very quick. The downside is that the randomness isn't random enough.

Now lets say that after shuffling, your first adjective is "grey"... you now loop through every noun and attach the adjective "grey" to it... and end up with a run of very similar code-phrases like: "grey dog", "grey cat", "grey bird", "grey elephant", "grey tiger", "grey wolf", "grey hippo"... not really very random. Plus, don't forget I'm using pairs of code-phrases. So we'd be getting pairs of code-phrases such as "from: grey dog to: grey cat"

Now right now, we don't have anything set up where the user could potentially abuse this obviously guessable sequence to get ahold of somebody else's package... however someday we may...

but what we were more concerned about was that if all the packages have code words that are very similar... there is grand scope for confusion and mis-labelling - which has happened to some of our customers before, and causes everyone no end of heartache and pain.

What we want is not only for individual code-phrases to be random, but for the sequence to also be kind of random too...

Solution 3: rolling loops

So I came up with a solution I've called "rolling loops".[*]

Let i and j be the loop counters. Normally, you'd match i against every value of j, before then incrementing i again.... but we want both i and j to change every single iteration of j... but if we increment both i and j for each step of the nested loop, how do we roll on to the next loop?

Well, what we *really* want is to loop through every possible permutation of the two arrays... so how about I expand out some of those and show you how to get to the solution.

Lets suppose you have two arrays that are just the numbers 0..2 The normal way to loop gives you a complete set of permutations:

[[0,0] [0,1] [0,2]] [[1,0] [1,1] [1,2]] [[2,0] [2,1] [2,2]]

If we used these pairs of numbers as indexes into the two arrays of words.. we'd get every single pairing of words, but as you can see, in the first set, every word-pair would begin with the first adjective, and in the second set, every code-pair would begin with the section adjective and so on - just as described as a problem above.

But there's another way of grouping the permutations:

[[0,0] [1,1] [2,2]] [[0,1] [1,2] [2,0]] [[0,2] [1,0] [2,1]]

In the first subset, we have numbers that are the same as each other, in the second set, we have numbers that have a difference of 1 (mod the size of the array), and the third set are of diff 2 (mod size)

This guarantees to roll through every permutation, and for every step in the loop it changes both i and j - it's only the *difference between i and j* that changes over the longer timeframe.

That may seem complex when I describe it, but it's really not. If we draw the permutations as a square:

   0,0  1,1  2,2
   0,1  1,2  2,0
   0,2  1,0  2,1

The usual simple nested loop starts at the top left of the square and moves down the column before hopping onto the next column... but the rolling loops simply goes across the rows first instead.

Now at first glance, this may not seem very much more random... but in practice (with the nouns and adjectives all randomised before beginning) it is much better. It will eventually loop around again and the adjectives will start to come in the same order - but we have around 100 adjectives - and it's extremely rare for us to have one customer send 100 parcels all at once.

Benefits of this solution:

  • it's quick (ie it doesn't first enumerate all permutations before randomising
  • it touches every permutation eventually (rather than random-search and maybe it never finds the final one)
  • you can stop the loop at any time (eg when you have populated your next 1000 code-pairs) and restart by reshuffling the arrays into a new random arrangement
  • each code-pair is only checked *once* for whether or not it already exists - rather than randomly finding the same codepair over and over forever)
  • it produces a different i and j for every single iteration even of the inner loop

There's only one more tweak that my colleague suggested he said that to create a product of each of the adj/noun arrays is actually relatively quick - so in *our* project the final rolling-loop is only applied to two copies of the full product of all nouns and adjectives (randomised separately)

Here's my implementation which will keep generating pairs of code-phrases until we have the specified number.

   def populate_codes_until(final_count)
    current_count = self.count

    # generate every permutation of nouns and adjs 
    # then shuffle and duplicate
    froms = adjs.product(nouns).shuffle
    tos = froms.shuffle
    max_val = froms.size - 1
      # here we are iterating the *step size* 
      0.upto(max_val) do |step_size|
        # here we simply step through both arrays
        0.upto(max_val) do |i|

          # figure out j based on step-size and mod
          j = (i + step_size) % (max_val + 1)

          self.find_or_create_by(:from => froms[i].join(" "), :to => tos[j].join(" ")) do
            current_count += 1 # actually created a new one

          return if current_count == final_count


yes, it's quite possible that this solution already exists somewhere and has a snazzier, and more widely-accepted, name... but I couldn't find it when I went googling, therefore it didn't exist :)

If you do happen to find it, let me know!

No comments: