Tuesday, March 24, 2009

Progress Without Perfection

Initially, I thought that given the nice detail in the Gravity team's papers, I would be able to implement these and get very close scores. It has turned out not to be the case. As the models get bigger and more complex my results have moved farther away from theirs: not terribly far, but far enough. This could be a factor of details they didn't include in the papers, details I missed in the papers, or my misinterpreatations (or programming errors) in implementing them.

This was irritating me last week, because I didn't really want to move on to the next level of the algorithm until I had gotten the previous one working. But now I've given up on that. Even though I am not as close to their results as I would like, I am making progress and now with my implementation of their 250 feature, retrained model BRISMF#250UM I have a quiz score of 0.8967.

The idea behind the retraining is that as the model adjusts features, the users at the end of the list are being adjusted against different movie features as the top of the list. This is because everything is trained simultaneously. The retraining as initially conceived by the Gravity team, was to reset the user features and lock in the move features. Thus, when training again users are adjusted against fixed movie features, hopefully for better results. But they ended up finding better results by not just retraining the now reset user features, they also continue to train the movie features from the original run. This ended up with much better results, and is how I got the 0.8967.

What I was doing today was adding in the neighborhood based correction to the predictions. I spent a long time looking at the formula, because I didn't really get it, but I think this is what the idea is.

In general, a neighborhood based (NB) approach uses the ratings of other users to determine what you would rate something. Since some users are more like you than others, contributions are weighted depending upon the similarity between the 2 of you, which can be determined in various ways.

In specific to Gravity's NB correction, some portion of the predicted error is added on to the regular MF prediction. The predicted error is found via a NB algorithm. Essentially you look in the training set (the ratings you have the answers for) and find what the error of each other rating by that user is. These are combined based on similarity weights again, to make similar items (in the case of Gravity's paper) contribute more to the projected error.

I'm still working on how to determine the parameters for the correction, but it looks like it could be promising.

Tuesday, March 17, 2009

Don't Rush Your Math

I've been trying to implement Gravity's BRISMF#250 model but have been getting much higher RMSEs.  I had no idea what was wrong so I decided to move on and start implementing the code to retrain features after the initial training.  To test I had to rerun my BRISMF#1 model which resulted in an RMSE 0.01 higher, which made no sense.  I looked through my code for changes I've made since I initially ran that and came across an error.

Most algorithms will require different ranges of random values to fill the feature arrays.  But, the way Java's random double generator is set up, it will only return a value between 0.0 and 1.0. Since that's not going to be helpful, I made a quick equation to modify the range. Unfortunately I rushed through what should have been a simple equation and came up with the following.

value = randomDouble * (max - min) - min;

The problem with this is if you wanted a range between -0.01 and 0.01, you end up with one between 0.02 and 0.03.  It should have been:

value = randomDouble * (max - min) + min;

Now that I've fixed this I can try BRISMF#250 again.  Lesson learned: don't rush your math (even really simple math).

Friday, March 13, 2009

Sucess (Relatively)

So now that the data is sorted by date as the Gravity team describes in their paper everything worked according to plan.  It ran in 13 epochs instead of 50-ish and got a .9217 RMSE instead of 1.0014.

Now I can finally start playing around with the other things they've tried.  Maybe at some point I can try something of my own.

Matrix Factorization

The idea behind matrix factorization (which is similar if not the same as what the Netflix community is calling SVD, singular vector decomposition or something) is that you can estimate an I x J matrix R (the ratings matrix) by multiplying two smaller matrices: an I x K and a K x J.  Each of those K rows/columns is known as a feature and the matrix factorization (MF) algorithm will estimate the scores for each user and movie.

What this allows you to do is essentially generate scores for how much action, comedy, romance, etc. is in a movie.  And also how much a certain user likes those aspects.

So I am using the papers written by the Gravity team to try this out.  They've outlined it really well, down to actually giving you the algorithm in psuedocode and listing different parameters they've tried.

So I started working on this and eventually got it to run.  But the algorithm that they say takes 13 epochs (loop cycles, essentially), took mine about 50 with the same parameters.  So now I'm trying to resort the data by date as they have to see if that will work.  But since that took me all day yesterday I decided to submit a prediction from the features I had built before.  It got a 1.0041 RMSE on the quiz set, which is not good.  

Just to give you an idea: Cinematch (Netflix's algorithm) gets a .9514, and the leader (Pragmatic Theory, as of today) has a .8597.

Things I've Learned About Java

Doing a project with this scale of data (100 million ratings, by 480189 users, on 17770 movies) shows a person just how much they don't know about programming yet.

Here's a few things I've learned so far:

1. Objects take up a crap ton of memory.  

You don't realize this when you're just making you're little Person objects in class, but when you try and fit 100 million ratings in memory, objects are not your friend.  It's all about arrays of primitive types.

2. Integers take up way too much space.  

Even the relatively small, programmer abused int type will make you cry when you try to do this.  Had to use short for movie ids and byte for ratings.  

3. Doubles are stupid.

I never knew this but double math in Java is really imprecise.  I don't have much choice (both speed and memory wise) but to use them, but adding simple decimals comes out really goofy.  Google "java double arithmetic" and see the madness.

4. Java does not pass by reference.

It passes references by value.  (I had heard this before but never ran into a problem with it until now) You may say, "What the heck does that mean?"  Well, for example take a Dog object called fido and pass it to a method that takes a Dog parameter.  Let's say it's called goofy in the method.  Right now fido and goofy point to the same thing.  If you change something in the goofy object it will change the same thing in the fido object, because they're the same object.  But if you say goofy = new Dog();  or goofy = null they no longer point to the same thing.  fido still points where it did but no goofy points to something else.

That doesn't seem like too much of a problem, but... I had an array of 48000 PrintWriter objects as I was trying to reorganize the data.  At the end of each run I used  a for each (for(PrintWriter print : outputArray)) to set them all to null, so they could be reused in the next run.  But since each element in that array was set into a newly declared print variable, setting print = null didn't set outputArray[i] = null.  So a wasted hour or two of running the sorting program.

That's all I have for now, but I'm sure I'll be posting about this again.

My First Attempts

I'll preface this by saying I have no hopes, expectations, or anything else of actually competing in this prize.  I came into this hoping to learn about recommender systems.

So I started by finding some research papers on recommender systems and collaborative filtering.  I read that and assumed I could start implementing a nearest neighbor algorithm.  So I pulled out my trusty old Java and wrote a quick app to import the training set into a MySQL database.  Once the import actually started to work, it took 10 hours to import.  Once that happened some reality set in as to how massive this was compared to anything I'd worked on before.  Little did I know, I still hadn't learned.

Then I started to implement my nearest neighbor algorithm.  After working on that off and on for a week or so, I ran it.  5 hours later it was still calculating similarity scores for the first prediction. With another 1.5 million ratings to predict that it hadn't even started on, I realized this was pointless.  So a trip to the Netflix forum and I realize that nobody in their right mind is running their algorithms from a database. After putzing around on my own for a while trying to figure it out.  I found a Java framework which I proceeded to modify to my heart's content to get it all to run in memory.

Now that I felt like this could start working I modified my nearest neighbor algorithm to work with the memory backend, but yet again it was taking way too long.  8000 out of 1.5 million in 12 hours.  Another trip to the forum and my mistaken assumption from the papers I had read that nearest neighbor was the best algorithm to start with was highly faulty.  On to matrix factorization...