Sunday, November 11, 2012

Kaggle Titanic Tutorial

I've been playing with the Kaggle Titanic tutorial for the last couple of days, and while I'm having fun with it, I was a little put off. The tutorial has a walkthrough of how to do analysis with python, and I was having trouble working with the examples they were doing. At first I thought it was just the python language, but I eventually came to realize something else.


The Kaggle tutorial is horrible code. Honestly, I think that's on purpose. Kaggle probably assumed that most people wouldn't understand the basics of object-oriented code to write the code correctly. The NumPy library also doesn't seem to be familiar with OOP - it operates entirely on multi-dimensional arrays. Which makes sense in a scientific setting, but the simple solution there is to return arrays from your domain objects, basically having your math functions operate over a view of your data. Which would be incredibly extensible.

Like I said, I'm pretty sure that was a conscious decision by the Kaggle guys. I hope it was, at least. I really don't think it does anyone any favors, though. Take for example this code from the Kaggle example:

1:  test_data[test_data[0::,3] == '',3] = np.median(test_data[test_data[0::,3] != '',3].astype(np.float))  
2:  test_data[test_data[0::,9] == '',9] = np.round(np.mean(test_data[test_data[0::,9] != '',9].astype(np.float)))  
3:  for i in xrange(np.size(test_data[0::,0])):  
4:    if test_data[i,7] == '':  
5:      test_data[i,7] = np.median(test_data[(test_data[0::,7] != '') & (test_data[0::,0] == test_data[i,0]) ,7].astype(np.float))  

Granted, the source has comments in it which make this ... understandable, if not readable. But compare this to my ruby version:

1:  average_age = test_data.select { |x| !x.age.blank? }.inject(0){ |x| x.age }  
2:  test_data.each{ |x| x.age = average_age if x.age.blank? }  
3:  common_embarkation = test_data.each_with_object({}) { |hash, x| hash[x.embarkation] += 1 }.max.first  
4:  test_data.each{ |x| x.embarkation = common_embarkation if x.embarkation.blank? }  
5:  average_class_prices = test_data.group_by{ |x| x.clazz }.inject({}){ |totals, (clazz, prices)| totals[clazz] = prices.reduce(:+).to_f / prices.size }  

This is using things like .age and .blank? - which make the code cleaner, even if you don't know exactly what is going on. Any person with even rudimentary coding experience can read the above and understand it - without comments.

(Disclaimer - I hacked together the above in about ten minutes and I have not tested it. I can see at least two possible bugs. Don't take it as gospel.)

I also think that Python is less object-oriented than Ruby. I found this little explanation of why there's a len operator in python, and while some of it makes sense, I still am not convinced that it's the best way of approaching the problem. It's true that conventions diverge over time, but I prefer Ruby's Array.count to Python's len() function.

I think it also lends itself to the Python code seen above. I don't think that's unusual Python code, whereas it would be considered by most people as horrible Ruby code. (were it written similarly in Ruby). If your language itself violates object-oriented principles, then you're going to end up with developers who don't write object oriented code and think it's okay.

To be fair, I know Ruby far better than I know Python. I'm planning to rewrite the tutorial in Ruby so that I feel that I have a complete grasp of what I want to do, and then re-write the tutorial in Python the way I think it should be. I think that will give me a good grasp of which language is better to work with going forwards.

I will say that it seems (from my initial web searches at least) like the Python libraries for analysis outstrip the Ruby versions by a big margin. In order to do the Ruby version I'm thinking of, I either need to find a Random Forest gem (which do seem to exist) or figure out some interop with Python, probably just by outputting an intermediate file from the Ruby code and then doing the last step in Python.

No comments:

Post a Comment