Data Mining Hackathon on (20 mb) Best Buy mobile web site – ACM SF Bay Area Chapter

I recently participated and won first place (of 97 teams) in the data mining hackathon sponsored by ACM SF Bay Area Chapter. My solution is here along with my code.

My solution mainly relies on several different string comparison techniques and by using the timestamp information (as the the overall mappings of data fluctuate over time). I do a query comparison using the entire query string and also perform a word comparison where I split the query into words. I create a score for each comparison, query and word, and merge them together along with some customer history to get my results. I use first the results for the entire training set and then add in changes for the particular week that the query was made. More info below.

For my solution, I created the following tables

1) query -> sku array mapping for the entire training set time window

2) query -> sku array mapping for each day

3) word -> sku array mapping for the entire training set time window

4) word -> sku array mapping for each day

5) item -> item array mapping (if a customer bought n items within a day, those items would be associated)

6) customer queries within a day of each other

To match queries, I removed all non-alphanumeric characters including spaces.

To match words, I also removed all non-alphanumeric characters

When building the query and word tables, I first add all the entries from the product catalog (I treat the title as a query and remove the xbox 360 at the end of each title).

Query Matching

When adding the training data, I compare each query with ones added from the product catalog and merge any misspellings within a levenshtein distance threshold. The code is conservative in its matching. If any numbers are present, each word or query being compare must have the exact same numbers (to prevent versions of games’ queries being merged together). The goal was to merge only very similar queries and words (overly aggressive matching hurt the results). If the query doesn’t merge with a query from the product catalog, a new entry is added. queries from the training set will not be merged together unless they are 100% match.

When comparing the test data queries with the tables generated from the product catalog and training data, I first look for an exact match and if found return that row of skus. Otherwise, I take the query and search for the largest word in the word table that exists in the query and then look for word matches in the remaing string of the query until I’ve searched the entire string. I then take all the words found within the query, add up their sku rows into an answer and then return. I also pull the rows from the query per day tables and use a week’s worth of data (3 days before, current day, 3 days after) and apply a gaussian to weight the current day the most and farther days less.

Word Matching

For word matching I compare the words withing a query to the words in my word table. If a word is not found, I use a combination of several word similarity algorithms (Levensthein, cosine, Jaccard, euclid, jarowinkler,soundex, figuring that combining their scores would smooth out each one’s weaknesses) to come up wih a sim score and select the best match. Once this is done, I add the sku rows for all the words. I also pull the rows from the word per day tables and use a week’s worth of data (3 days before, current day, 3 days after) and apply a gaussian to weight the current day the most and farther days less.

Adding results

I’ve found that combining the scores from the query matching and word matching provides a much better result than using either single one. Each method has its own strengths and weaknesses but combined works well. I also combine data using the customer’s history and similar items (though I weight this much lower than the query and word scores because it is not very reliable). In addition, I add in any other queries that were made within the last day by the customer from the test set to handle situations where a customer is looking for multiple items and doesn’t necessarily buy them in the order they were queried.

Misc

I also added a check to see if a numerical sku was entered in the query and if so make sure that the sku is part of my answer. I also never include a sku that a customer has already purchased.

I’ve attached my code. Inside is a small README.txt describing the code and how to run it. It is written in Java. I’ll try to answer any questions. It also includes the file I submitted for my final score.
BestBuy.7z

Add comment October 4th, 2012

Tom’s Hardware Battlefield 3 Performance Results

Pretty thorough article describing a lot of AMD and Nvidia cards here. Matches what I’ve been seeing with my radeon 6870 of about 50 FPS on high settings here.


Add comment October 31st, 2011

QX6800 + Radeon 6870 HD running Battlefield 3

I made some upgrades to my $500 gaming machine built in 2008 and am able to run at the default HIGH settings in battlefield 3 with 40-60FPS. Here are my settings in case someone else has a similar system. Good to know older hardware on an LGA775 board can still handle decent games. The QX6800 processor may be a few years old but still performs pretty well.

CPU: Intel Core 2 Quad Extreme QX6800 @ 3.2 OC $75.00 (2011 on craigslist)
CPU Cooler: Cooler Master HyperTX2 $24.99 (2007)
MB: Gigabyte GA-P35-DS3L $89.99 (2007)
RAM: Wintec Ampo DDR2 PC-6400, 2Gbx2 $35.99×2 (2007 and 2009)
Graphics: $169.99 – $30.00 rebate (2011)
Hard Drive: Seagate Barracuda 7200.1 320GB 16Mb $79.00 (2007)
PSU: 850 Watt

3DMark11: P4070

More info on my machine here and here:

***Update, after Catalyst’s preview driver version 3 I am able to run on full ultra settings around 30FPS and if I significantly lower AA I can run in the 50FPS range so that’s where I plan to run. Really happy with ULTRA settings on some older hardware.


1 comment October 25th, 2011

Upgrading from E2160 to QX6800 for Battlefield 3

In my previous post, I discussed upgrading my XFX 9600GT to a to prepare for BF3. Though this update did a lot to improve the video quality of BF3 beta (and BFBC2), I wasn’t getting very high FPS. In BFBC2, my system can run around 30FPS with the highest video settings, and BF3 beta ran at HIGH (not ULTRA) with 20-30FPS. My current processor is an E2160 1.8 (OC @ 3.0). Since I have an LGA775 motherboard, I was limited in what upgrades I could make. BF3 recommends using a quad core processor. The problem was that since these processors came out as much as 3 years ago, the prices are quite high. Even used CPUs on craigslist could be $200-300 for a used QX6800 and $1000 for new! I didn’t really want to put much money into my aging system and wasn’t quite ready to upgrade to a better socket so I went on Craigslist and luckily found a QX6800 with 2 sticks of RAM (which I am not using) for $75. The processor has never been overclocked and hasn’t been used in 6 months. This processor seems to be more expensive than the Q6600s because of an unlocked multiplier. Though not as good as modern processors, I’m hoping to get enough out of it so that my machine is no longer CPU-limited. I will post results when I have them.

*Update The processor works fine but is already running at 65C at stock speeds with an aftermarket cooler. I pushed the voltage down some (1.25V) and was able to overclock to 3.2Ghz, not much but it maxes out at 65C. I’m pulling +60FPS on the highest BFBC2 settings and am thinking that will be similar to BF3. Looks like my machine was CPU bottlenecked.

*Update 2: Played BF3 at midnight (10/25) and am getting between 40-60FPS with the default HIGH settings. The game looks great and I’m glad the new video card and CPU can handle this game. This is definitely my favorite of all the BF games (and I liked BF2 and BFBC2 a lot).

CPU comparison
Processor | # cores | stock speed | oc’d | cache | FSB | BFBC2(HIGH) | 3DMark11
___________________________________________________________
E2160 | 2 | 1.80Ghz | 3.0Ghz | MB | 800Mhz | 25-35FPS | P2783
QX6800 | 4 | 2.93Ghz | 3.2Ghz | 8MB | 1066Mhz | 60-70FPS | P4070


1 comment October 18th, 2011

Upgrading for Battlefield 3 (Radeon HD 6870)

So I’ve decided to upgrade my video card for the upcoming BF3 release. My original setup built in April 2008 (modeled after Tom’s Hardware $500 build and discussed here ):
CPU: Intel Pentium E2160 Dual-Core $69.99
CPU Cooler: Cooler Master HyperTX2 $24.99
MB: Gigabyte GA-P35-DS3L $89.99
RAM: Wintec Ampo DDR2 PC-6400, 2Gb $35.99
Graphics: XFX 9600GT (Tom’s Hardware used a 8800GS) $169.99 – $30.00 rebate
Hard Drive: Seagate Barracuda 7200.1 320GB 16Mb $79.00
Case and PSU: Antec NSK4480B with Earthwatts 380W PSU $69.99
New Graphics:

Since then, I’ve added 2GB of RAM and moved to Windows 7 (64bit). Since I now have 3 kids, I can’t justify to myself spending $500 on a video card and probably would be bottlenecked by my older system anyway, so have decided to go the cheaper route. Based on Tom’s Hardware $500 build I decided to go with the :
“SAPPHIRE 100314-3L Radeon HD 6870 1GB 256-bit GDDR5 PCI Express 2.1 x16 HDCP”. On New Egg, it was $169.99 – $20.00 rebate making it nearly identical to the price of card I bought 3 years ago for my original build. This card has PCI Express 2.1 while my motherboard only supports 1.0 so my system will be limiting the capabilities a bit (or maybe not). Also, this card (from what I could read) will need a maximum of 150 Watts (my current card uses around 100 Watts). I was a little concerned that I wouldn’t have enough headroom with my 380 Watt PSU. To ensure I would, I bought a Watt meter to discover the actual usage of my machine. Without doing anything too intensive, my machine only draws around 120 Watts; when running BFBC2 it goes up to around 170 Watts. So, it seems like adding an additional 50 Watts shouldn’t be an issue.

Lost Overclocking, Found Dust
Something else I wanted to do was perform some benchmarks to see what kind of difference the new card makes. In the process of preparing for this, I realized that all my overclocking settings had been reset at some point (I think when I added new memory). I spent a bit of time trying to get them reinstated and found that I was having problems with the settings sticking (boot up would fail and revert to standard settings). I found that if I unplugged 2 external hard drives from my machine at power up, the system would boot fine. I don’t know why this makes a difference, but it does and tested multiple times to confirm. Another thing I found when I reset the overclocking was that my CPUs were getting way too hot. They were reaching 70C and I have my motherboard set to shutdown when this occurs. The culprit turned out to be a lot of dust. The entire radiator of my CPU heatsink was clogged and there was quite a bit of dust on the input/output areas of the case and on the fan. After removing the heatsink and running the vacuum attachment on it, I put everything back in the system and was getting highs of around 59C using Prime95 (the same numbers when I originally built the system).

Benchmarking
Normally I would run something like futuremark to get some benchmarks, but decided (out of sheer laziness) to just capture the FPS in BFBC2 using fraps and informally compare the differences. With mid-high settings, my non-oc system was averaging around 12 FPS and after figuring out that I wasn’t overclocked anymore, noticed a huge difference in playability when my oc’ed system ran around 35 FPS. I’ve run BF3 beta as well and was getting similar numbers though my oc’ed system was averaging in the low 20s.

*Update, when I upgraded my card to the 6870, my default video settings shot up to HIGH (I didn’t check what my previous card was using, but it must have been extremely low because the game looks great). I found though that the FPS were running in the 20-25FPS range. When I lowered the anti-aliasing (2X) and the anisotropic(2X) filter, I got consistent 30FPS. Though the game looks good and is definitely playable, I was hoping for a little bit more of a boost from the new card. It’s possible that my older CPU, even overclocked, is bottlenecking the performance a bit. I’m going to continue tweaking the settings to get the best display possible.


Battlefield 3 beta default settings


4 comments October 5th, 2011

Google Software Engineer Interviews

There’s some mystery surrounding the Google interview process and I spent a lot of time looking up other peoples’ experiences so I figured I’d throw mine out there as well. All in all, I didn’t think the process was as mystifying as other posters have suggested, though it is definitely not typical. For me, I had a non-technical phone call, a technical phone interview, and 4 in-person interviews. Below are my observations.

Continue Reading 2 comments July 19th, 2011

Loading OBJ and 3DS Models In Java World Wind

I’ve been following this thread for adding OBJ models in Java World Wind. Some people came up with some pretty good code on how to handle it here: http://forum.worldwindcentral.com/showthread.php?t=15222

Add comment December 15th, 2008

Download Pictures off your Cell Phone

I have Verizon Wireless and a Motorolla MOTOKRZR K1m and found a way to get pictures off of it so I thought I would share it.  First, you need a USB cable that supports the mini-USB interface that the phone has.  You can probably get it at about any computer store (Radio Shack would have it).  Then, download the USB driver from the Verizon website .  After the driver is installed, you can plug in the phone and it will charge but you cannot access the files.  To get to the files, there is a program called bitpim.  This program will let you download just about any data off the phone.  It supports many more phones besides mine so if you can find a way to connect your cell phone to your PC, you should be able to use this program as well.  That’s it.

Add comment October 15th, 2008

Overclocking my E2160

Since my previous gaming machine died, I have been looking to replace it. I built my previous machine before I was married or had kids and now it is hard for me to justify spending 2-3K on a computer. Luckily, I found a post on Tom’s Hardware laying out a $500 dollar gaming machine. It seemed like a pretty good deal so I decided to go for it. Here are the parts (I upgraded the video card and added more disk space to the hard drive).

CPU: Intel Pentium E2160 Dual-Core $69.99
CPU Cooler: Cooler Master HyperTX2 $24.99
MB: Gigabyte GA-P35-DS3L $89.99
RAM: Wintec Ampo DDR2 PC-6400, 2Gb $35.99
Graphics: XFX 9600GT (Tom’s Hardware used a 8800GS) $169.99 – $30.00 rebate
Hard Drive: Seagate Barracuda 7200.1 320GB 16Mb $79.00

Case and PSU: Antec NSK4480B with Earthwatts 380W PSU $69.99

Case purchased from Buy.com, everything else was newegg.com
Total: $555.14 – $30 Rebate = $525.14

On Tom’s Hardware and various other sites, people have been able to overclock the 1.8Ghz processor to between 2.8 and 3.2 Ghz without too much effort. I’m hoping to, at a minimum, get 2.8 and maybe 3.2 if all goes well. Most people agree that leaving your memory at a 1:1 ratio with your CPU is the way to go so here are my goal settings:

3200Ghz = 400Mhz FSB, 8 Multiplier with Memory 1:1 *
2800Ghz = 400Mhz FSB, 7 Multiplier with Memory 1:1

3000Ghz = 333Mhz FSB, 9 Multiplier with Memory 1:1 (667Mhz)

2800Ghz = 311Mhz FSB, 9 Multiplier with Memory 1:1 (622Mhz)

*Tom’s Hardware achieved this.

I will probably be more conservative than Tom’s Hardware. I’m trying to keep my voltage below 1.4V because I want to keep this machine for a while and because this is my first time overclocking something like this. I plan on posting the results in the next few days as I overclock my machine.

So, with no overclocking, here is where my system is:

temps.PNG

*In the below picture, it is showing a 6x multiplier because it was in an idle state. The motherboard automatically lowers the speed under low load. It should be 200Mhz at 9x.
cpuz1.PNG

cpuz2.PNG

So, the temperatures are staying pretty cool because of the CPU fan I purchased. I am hoping to keep my max OC temp below 65C and at no overclock and idle I am getting 37C so I should be in good shape.

First Test 2.8Ghz

So, my first test I upped the FSB to 311 with a 9x multiplier to get 2.8Ghz. I left the voltages on auto for now. I might lower them later to help with heat but I was only showing around 56C while running prime95 so it doesn’t seem to be an issue right now. I am not messing with the memory yet. So, 2.8Ghz didn’t cause any problems so I decided to move up to 3.0Ghz

Second Test 3.0Ghz

For this test, I changed the FSB to 333Mhz with a 9x multiplier to get 3.0Ghz. I also changed the memory multiplier to 2.4 to get it close to its peak 800Mhz. I still left all the voltages on auto and booted the system with no problems. I’m going to let prime95 run overnight but right now it looks like the max temp I am getting of the CPU is around 58-60C and that’s with using the default voltage that the motherboard decides (1.376 seems to be where it is running, I think the default is 1.325). So right now, I’m pretty happy with where I am. I haven’t done any kind of optimization with the memory yet though. I might try getting to 3.2Ghz in the next few days but for now I’m happy with my 67% improvement.

Screenshots at 3Ghz while running prime95

temp

cpu1 at 3 Ghz

cpu2 at 3 ghz

I also overclocked my video card using ntune. I’ve heard that 10% is a pretty safe increase so I upped the core from 650 to 715 and the memory from 900 to 990. For fun, I decided to run this configuration using the 3d Mark 06 test and compare it to the $500 Tom’s Hardware system. The main differences being they overclocked their processor to 3.2Ghz and tweaked the memory some, while I ran at 3.0Ghz but upped the video card from an 8800GS to an 9600GT. I was happy that I was able to beat their overclocked system (though I only did one test, they had many different versions) and I also beat another low cost machine they built for $837 (though their overclocked version beat mine).

myscore.PNG

Some links I used when building this:

http://www.tomshardware.com/forum/244382-29-overclocking-e2160-gigabyte-965g

http://www.tomshardware.com/forum/246050-29-overclocking-e2160-0ghz

Overclocking Guide

$500 PC

http://www.hardforum.com/showthread.php?t=1278434

http://www.tomshardware.com/forum/246669-29-9600gt

http://www.overclockersclub.com/reviews/xfx_9600_gt/

http://www.legionhardware.com/document.php?id=729&p=1

Update

I went in a manually lowered my CPU voltage to 1.365 and bumped my memory up to 5-5-5-12 and everything is looking stable.  I also OC’d my 9600GT to 727/1000 and reran 3DMark06.  This time it went up a little to 10611.   I could probably break 11000 if I OC’d to 3.2Ghz but I think I’m going to stick with what I have for a while.  oc.PNG

Update 2010-07-14
I upgraded from Windows XP SP 3 to Vista Home Premium 64bit with the above settings and received a 10321 score. So, moving to Windows 7 didn’t really seem to affect performance. I do notice that games look a lot nicer now that my PC can take advantage of Direct X10.

1 comment April 23rd, 2008

Netflix Prize

After reading about it for a month, I’ve decided to enter the Netflix Prize contest.  The goal is to come up with an algorithm that accurately predicts whether a user will like a movie.  The goal is to get <0.85 RMSE on a qualifying data set that NetFlix provides.  Before starting I got some baseline numbers.

Using overall mean = 1.1296 RMSE

Removing movie mean = 1.05 RMSE

Removing movie and user mean = 0.9841 RMSE

So, without doing anything fancy, you can get to .98 just by removing global means.

I plan on using two different approaches to solving this problem: Clustering and KNN using movie to movie neighbors.  I'm hoping to  beat 0.95 since this was the original score that Netflix was wanting a 10% improvement.

-- Update

Using the clustering algorithm with 64 clusters and 10 iterations and removing movie and user means, I was able to get a 0.9466.  So, I've already beaten the 0.95 Cinematch score.  I'm going to try some different combinations of parameters for the clustering algorithm to see if I can get closer to 0.9 and then begin working on the KNN algorithm.

Netflix leaderboard

*I am ThisIsTooHardForMe on the leaderboard

1 comment April 16th, 2008

Previous Posts


Categories

Links

Feeds