Thursday, February 6, 2014

Redis autocomplete

Autocomplete is a very useful tool in a webapp. So when I found the use case, I began looking for autocomplete solutions in Redis.

There are some very neat posts about how to build one, and there's also a gem: seatgeek/soulmate. It's very nice and feature complete, but it's pretty much overkill for what I'm trying to do. It's pretty much a data store in itself. And it'll generate lots and lots of keys, which I don't really want to do - I'd like to keep all of my data in the same store, and the key explosion will make that very difficult.

If you read over the various options for redis autocomplete, you'll find quite a few descriptions of indexing via hashkey. And it's pretty straightforward if you think about it. But I was convinced that there was a way of doing autocomplete that relied on sorted sets.

And eventually I figured it out. There are two basic requirements to get this to work. First, you have to have a unique description for every key you want to index. Second, you have to invent a mapping for each autocomplete word to the set of integers such that strings have a weight that works suchly: a < aa < ab < b ... and so forth.

The mapping function is fun, because it's basic computer science. Instead of a hexadecimal number set that contains sixteen values for each digit, I'm building a number set that contains 27 values for each digit. Then I take and right-pad (or truncate) each value I want to index so that they're all the same length. This provides the property listed above where a < aa < ab < b - and now I can store those values as scores in a sorted set.

This is why we need a unique description for each value - the description becomes the key for the sorted set. Take for example - we want to index the value 'foo' and associate it to a post id: 123. We run our autocomplete value 'foo' through the function above, yielding some (big) integer. We take that integer as the score, and for the value we store a tuple ('foo', 123). This allows us to build an index that contains multiple values to multiple ids.

When searching for a value, the process is very similar. Let's say the user starts by typing an 'f', since he's searching for the 'foo' value we stored above. The system will first compute the value for 'f', which given our function will be less than the value for 'foo'. Next the system will compute the next value for the last character of the search string. In this case, the system will compute the value for 'g'. Again, given our function, this is guaranteed to be greater than the value we're looking for.

Now we simply get from redis any set members with a score between those two values. Limiting the number of results is a good idea, but in most cases users are probably only interested in the first ten results in any case. They can always type more characters to refine the autocomplete.

So what is the performance? In Redis, ZRANGEBYSCORE (with a limit) is listed as O(log N) for N elements in the set. ZADD is O(log N) as well. For a searching algorithm, that's about as fast as you're going to be able to get.

There are a couple of weaknesses to this scheme that don't make it perfect for everyone. This scheme won't work if you need to weight your terms. That's a good use case for something like soulmate, which supports that.

Also, this algorithm is length-limited. Given a 64-bit (signed) long value as the score and a 37-character set, by my calculations you'll overflow the score if you go over 12 characters. For my use case, that's a lot of characters. I've got my length set to 10 characters, and I can't imagine I'll exceed that.

No comments:

Post a Comment