Thursday, February 6, 2014

Redis autocomplete

Autocomplete is a very useful tool in a webapp. So when I found the use case, I began looking for autocomplete solutions in Redis.

There are some very neat posts about how to build one, and there's also a gem: seatgeek/soulmate. It's very nice and feature complete, but it's pretty much overkill for what I'm trying to do. It's pretty much a data store in itself. And it'll generate lots and lots of keys, which I don't really want to do - I'd like to keep all of my data in the same store, and the key explosion will make that very difficult.

If you read over the various options for redis autocomplete, you'll find quite a few descriptions of indexing via hashkey. And it's pretty straightforward if you think about it. But I was convinced that there was a way of doing autocomplete that relied on sorted sets.

And eventually I figured it out. There are two basic requirements to get this to work. First, you have to have a unique description for every key you want to index. Second, you have to invent a mapping for each autocomplete word to the set of integers such that strings have a weight that works suchly: a < aa < ab < b ... and so forth.

The mapping function is fun, because it's basic computer science. Instead of a hexadecimal number set that contains sixteen values for each digit, I'm building a number set that contains 27 values for each digit. Then I take and right-pad (or truncate) each value I want to index so that they're all the same length. This provides the property listed above where a < aa < ab < b - and now I can store those values as scores in a sorted set.

This is why we need a unique description for each value - the description becomes the key for the sorted set. Take for example - we want to index the value 'foo' and associate it to a post id: 123. We run our autocomplete value 'foo' through the function above, yielding some (big) integer. We take that integer as the score, and for the value we store a tuple ('foo', 123). This allows us to build an index that contains multiple values to multiple ids.

When searching for a value, the process is very similar. Let's say the user starts by typing an 'f', since he's searching for the 'foo' value we stored above. The system will first compute the value for 'f', which given our function will be less than the value for 'foo'. Next the system will compute the next value for the last character of the search string. In this case, the system will compute the value for 'g'. Again, given our function, this is guaranteed to be greater than the value we're looking for.

Now we simply get from redis any set members with a score between those two values. Limiting the number of results is a good idea, but in most cases users are probably only interested in the first ten results in any case. They can always type more characters to refine the autocomplete.

So what is the performance? In Redis, ZRANGEBYSCORE (with a limit) is listed as O(log N) for N elements in the set. ZADD is O(log N) as well. For a searching algorithm, that's about as fast as you're going to be able to get.

There are a couple of weaknesses to this scheme that don't make it perfect for everyone. This scheme won't work if you need to weight your terms. That's a good use case for something like soulmate, which supports that.

Also, this algorithm is length-limited. Given a 64-bit (signed) long value as the score and a 37-character set, by my calculations you'll overflow the score if you go over 12 characters. For my use case, that's a lot of characters. I've got my length set to 10 characters, and I can't imagine I'll exceed that.

Monday, December 9, 2013

Java

One: "So we've got this strongly and statically typed, compiled language here. We call it 'Java'."

Two: "Sounds neat! How do I configure components in it?"

One: "We prefer that you do all of your configuration via deeply nested, fragile XML that is not validated in any way."

Two: "Why don't we just configure the components in ... code?"

One: "Because then you can't change the configuration after compile time."

Two: "We need to dynamically change which factory class creates the database pool?"

One: "JUST GO TYPE! MANAGEMENT LIKES TO HEAR LOTS OF TYPING!"

Two: ...

Wednesday, December 4, 2013

Next: Scala

I just finished reading this great article about Play / Scala and async.

http://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hell

For most of the applications I build, I don't need the sort of performance that requires a async framework. And from what I've seen of node.js, I'm not going to mess with it unless I need something that fast.

But the things in that article are pretty. Those comprehensions are nice.

As much as I love Ruby, there are a few things that it doesn't do well. It is designed for code writing speed, not execution speed. And it does lack some of the safety features of a compiled language.

After I finish with Twitarr, it'll be time to start messing with some Scala. And probably Play. Looks like fun.

Monday, December 2, 2013

Ember.js views

In the view code (i.e. the code that extends Ember.view) you have to prefix any controller property accesses with 'controller.' - i.e. if you're getting the foo property of the the controller, you have to use @get('controller.foo') - view properties you can get without any prefix, as one would expect.

This kind of makes sense, although I'd point out that there are many other places where Ember automatically looks to parent elements for data.

What confuses me is that in the template that is used by the view you have to prefix view accesses with 'view.' - controller properties are bound without a prefix though.

Yes, the template is EXACTLY opposite the view code.

...?

Friday, September 27, 2013

Thread safety

(I wrote this some days ago but didn't post it. I realized that it reinforced my Ruby camps post so I'm putting it up now)

A few days ago I found this great online book by Avdi Grimm (author of Ruby Tapas) named Objects on Rails. The link goes to the free (as in beer) version of the book, but I highly recommend spending the five bucks and supporting the author. (Sadly the downloadable version does not contain an HTML version, which would be really useful for referencing)

I spent a couple of days deep-diving into the book. Mostly it reinforces a bunch of patterns that I've been using in my own Rails projects, like the Draper gem. (Avdi uses exhibits, which are very similar to what Draper provides as discussed by the authors here)  

Something was bothering me about it though. As I started implementing some of his patterns in my own code, I realized what the problem was. The top-level blog class is not thread-safe.

This is not a problem in MRI, since it uses green threads instead of system threads. However in JRuby, using a single global object like this has the potential for race conditions. It's very unlikely that it could cause a problem in this little blog app - there's very little state in the Blog class and it's mostly initialized when the class is initialized.

But it would be easy for a new developer to see this class and add something like a caching layer. Any mutable state on the classes is going to be a problem, and it's unlikely to be caught on a development system, either.

I've ended up refactoring the top-level blog class out of my app. Instead of leveraging a root object to facilitate unit testing, I'm using DCI patterns to help. It's working well with Redis objects, which are very similar to basic Ruby types. I'll post more about that later.

A tale of two Ruby communities

I've lately begun to realize that there are two Ruby communities.

The first runs MRI - they use Mac OS and occasionally Linux if forced to. They have green threads so they don't worry about ugly things like threading or mutexes or deadlocks. They sometimes write libraries in Ruby, and sometimes in C or C++.

The other community runs JRuby on whatever platform they feel comfortable in (including *gasp* WINDOWS). They have OS-level threads so they avoid things like global variables and mutable class state. Most of the time they write libraries in Ruby, but occasionally they'll include dependencies on Java libraries.

(As a side note, I have no experience with Rubinius. I think it's in the second camp but not sure.)

When you find a new gem that you'd like to use, it's important to figure out what camp it belongs in. Some are very easy to figure out - if it uses Java, you're in the JRuby camp. If it uses C or C++ code, it's the first. Other gems are harder to figure out. Gems that are written in Ruby are theoretically compatible with either implementation. But they aren't always 100% compatible.

Personally I belong in the second camp. I write multi-threaded code. When I run across a nifty gem that uses class variables to hold database connections, I am a sad programmer.

My primary development box runs Windows. That's a certain special level of hell. Many times a gem that otherwise works perfectly fine simply won't run unit tests. Or gems that load their configuration from the local database install. (actually can't blame that one on MRI - that project is entirely JRuby based.)

(There's a certain attitude that Ruby developers get when you say you're running Windows. Comments such as "get yourself a real development machine" are commonplace.)

Ruby's a great language. But it's very difficult to migrate developers from other languages when the barriers to entry are so specific. Ruby developers are proud that they don't have to use a IDE (and right to be proud, I believe), but how much more of a barrier to entry is having to use a MacBook?

And why the hell are people ignoring OS threads?

Won't someone please think about the OS threads?!

Tuesday, September 24, 2013

Which user?

"As someone who builds software, it's important to keep in mind the users."

Most of the time when someone uses that phrase they are referring to the end users of software - the people typing and clicking and being annoyed by bugs. But there are other users as well - I see at least three kinds.

The most overlooked group are the administrators. It's a fuzzy collection (sometimes in more than one way - at least half the sysadmins I've ever met don't hardly shave) of people, but it can include people who install our software, people who maintain the software and hardware underneath, and people who manage end user experience.

I say it's an overlooked group because often the installation is the last thing a developer thinks of. It's natural to want to get the idea down and to see it running (on our dev boxes of course), and sometimes that gets away from us. The funny part is that we're often the group that complains the loudest when something is hard to work with - then we go and pester the systems team to install it for us.

Software that is easy to install and maintain is a treasure.

  • Apache Tomcat is dead simple to get working, works on just about any OS, and is generally obvious what's wrong with it when something's broken. Configuration is XML, which I think is it's largest downside.
  • Redis doesn't run in Windows, but I've already gone on about how easy it is to install.
  • I love the way Chrome / Chromium patches itself unobtrusively.

It's important to have a deployment plan for software, and for it to be easy to follow. Limit dependencies as much as possible. Have a good configuration scheme (e.g. not XML). Have a simple logging strategy.

(I have the hardest time finding the correct logs in Hadoop. This may be due to the fact that it's been set up by developers, but I'd say that goes back to having a good configuration scheme.)

The second (occasionally) overlooked group are developers. Remembering this group is generally a function of experience - usually the experience of trying to fix someone else's spaghetti code. "If it was hard to write, it should be hard to read" is thankfully a mantra that has fallen by the wayside, and I hope it stays dead. Code that is well constructed and readable is also maintainable, and that helps everybody.

Ruby developers have the edge over Java developers here. Java code can be (and usually is) well constructed, but the Java language limits how readable the code is. Ruby code can be a thing of beauty, and there are discussions about how to make it so. Such conversations would be anathema to Java developers, and I think that is a mistake. I've never gotten much out of poetry, but well-written prose pleases me on an aesthetic level, and well-written code is just the same. Seeing a function that solves a problem in a smart manner - feeling that "a-ha" when you capture the essence of a block in your head - provides an emotional satisfaction to our occupation that is under-valued.

Last are the end users. These are arguably the most important users. If we don't have end users, then our software really isn't doing much. There's no shortage of blogs and discussion of the whys and the hows for user experience, and I don't have much to add that couldn't be found elsewhere. End user experience is something that developers sometimes overlook. I've met a number of developers who are happy to push those responsibilities off onto the designers, as though it wasn't a part of their job.

All developers should be able to involve themselves in UX. There's really no excuse for not being able to do so. If we have a strong design team and product owner(s) then we may not have to do much in this area, but developers need to be cognizant of the techniques. After all, we are closest to the computer - we are often able to see things that the computer can do that others can't.

Another interesting point is that the groups can become muddled depending on what kind of software we're working on. If we're writing a library, then the end users are really the developers using the library - they're the ones who have the experience of using the interfaces we're providing. (And library interfaces are an art form all of their own)  If writing infrastructure software then the end users become the admins who will be deploying other packages on top.

It's important to keep in mind the people using your software, no matter how they're involved.