Saturday, December 29, 2012

Kaggle FTW

After over a month of playing with random forests, I ditched them and built a decision tree for the titanic problem. My first attempt scored higher than the tutorial score, which was my initial goal for my exploration into data science. It wasn't much higher, but it was a few more correct results, and I'm calling that a win.

I'm reasonably sure that the random forest algorithm was overfitting the data, which is why my scores never went above the tutorial score. I want to explore some other algorithms first (and I have an idea for a new feature that I want to extract) but I may eventually end up implementing the random forest in Ruby just so that I know what it does.

Another thing that bothered me about the random forest was the simple non-consistency of the results. I worked up 10-fold cross validation for my algorithms and the results for the random forest algorithm varied as much as five percent between runs over the same features. I'm not sure if this is due to my not understanding the random forest or just intrinsic to the algorithm. The documentation for the python implementation is basically non-existent (repeating the name of the parameter in the sentence does NOT help a user understand what the parameter DOES) and it's entirely possible I'm using it incorrectly.

In any case I've succeeded in my primary goal of beating the tutorial. Next goal? Higher.

No comments:

Post a Comment