MovieRecs - Blog

Final touches

4/27/2019

Having the presentation out of the way feels like a weight off my shoulders, but I'm still not quite satisfied with my project.

I spent a lot of time throughout the semester trying to research the "right" way to create a recommender, and not enough time creating a recommender. I learned very quickly during the last half of the work period of the semester that simply diving in is the best way to find ways to optimize the system.

I now have a functional infrastructure: the user interface allows user actions to invoke processes in the database and with the recommendation service, and the recommendation service returns meaningful data about movies the user should try out. That was really my base goal. But now I'd really like to optimize the recommender.

When user's rating predictions are calculated, there tend to about 2000-4000 predictions made, and there are a lot of ties for high ratings. That's a lot of potential choices, so there's no way its as accurate as a user would like.

The first way I'm going to optimize is through dimension reduction. This method uses a threshold to determine whether or not an item belongs in the pool of recommendable items. Essentially, if an item does not have about half of the total amount of possible ratings, it is removed from the item-rating matrix before similarities are calculated. This reduces time cost and also provides movies that are more relevant.

The second way I'm going to optimize is by adding a layer of comparison into the nearest neighbors portion of the prediction process. I can take into account the genre when providing K top predictions by ranking the genres for a given user and providing that ranking as criteria for the KNN choosing.

I think that implementing both of these ideas will result in a clear improvement upon the current iteration. I'd also like to use some type of accuracy function by comparing predicted ratings with those the user has already rated - but I'm not sure I'll have that done before my final defense.

Big status update

4/21/2019

It's been far too long since my last post - so hopefully this long post gives a good glimpse into what I've been working on, before presentation week begins.

1. The first thing I should update you on is the machine learning component of my project. About three and a half weeks ago and many blog posts later, I discovered a standard for item recommendation called collaborative filtering. There are two primary types of CF, one is user-based (UBCF) and the other is item-based (IBCF). User-based CF compares users based on how they've rated items, and then recommends items purchased by similar users to those who haven't purchased them yet. Item-based CF compares items based on how they've been rated by users, and then uses those similarity ratios in conjunction with a user's past ratings to predict how the user will rate items that he/she has not yet rated. I decided to go with the item-based collaborative filtering method because the computation for that can happen "offline", allowing it to speed up the user experience. At least that was my understanding of the upside to IBCF. I would then plan on using a basic k-nearest algorithm to choose the 5 highest rating predictions for a user and provide those movies as recommendations. I was able to implement this in Python. Wanting to leave time to build the user interface and make sure I could successfully map a dataset to a user-facing application and database model, I left the rating system as a basic one-dimensional 1-5 scale. This is probably the biggest downfall of my project, but other than that things have turned out quite well. Here is a nice article that guided me through most of my IBCF implementation:

https://medium.com/@wwwbbb8510/comparison-of-user-based-and-item-based-collaborative-filtering-f58a1c8a3f1d

The IBCF implementation is a class (IBCF) that takes in a Python dictionary of User objects and a list of Movie objects. After the actual user and movie records are retrieved from the database, and before they are passed into the IBCF constructor, they are mapped to Python objects (User and Movie, respectively) so that they then contain a decimal index rather than the hex ID assigned by MongoDB upon insertion. The User object contains its MongoDB ID for later recognition, the decimal index, and a dictionary of ratings (<movie-index>:<rating>). The Movie object contains its MongoDB ID and the decimal index. Upon instantiation of the IBCF object, the users and items are mapped to an item-rating matrix which makes it easier to calculate item similarities. A class function is called that generates an item-item similarity matrix from the first one and finally another function uses this similarity matrix to predict the ratings users will give to movies they have not yet rated.

2. The next big piece of the project is the interface that allows users to interact with a catalog of movies and make their ratings. I stuck fully with the Node.js application and Express.js framework for routing requests. The EJS templating language worked great for making my views dynamic and presenting data to the user. Node applications work best with a Model-View-Controller design. Models are represented by Mongoose.Model; Mongoose is an Object-Modeling framework that is based on MongoDB's Node.js Driver. The views are represented by EJS files, which contain HTML markup that allows for Javascript to be directly entered in the file with special tags to output object data rendered with the page. The controllers are done in Javascript and handle all of the back-end functionality once an HTTP/HTTPS request is received. Overall, the application ended up being extremely modular.

3. Another crucial aspect of the project is the dataset which I chose to use. My last post talked about the datasets provided by MovieLens, a research group out of the University of Minnesota. I was originally thinking of using the largest dataset available (27 million ratings), but upon further inspection and an attempt to write a mapping script, it was not formatted consistently and would have taken eons to successfully parse. I chose to go with their 100k rating dataset, which they suggested was best for education and development. The files included with this set are formatted consistently, which made writing a mapping script a lot easier. I still have to finalize the initial population of my database with this dataset, and I am trying to decide how best to demo my application.

4. Connecting all of these components was another challenge. I could not run the recommender computations on the same machine because it would bog down the rendering of the user-facing application. Thus I hosted my user-facing application on one server and the IBCF service on another through DigitalOcean. They provide virtual machines which I connected to custom domain names and both are running Node.js servers. The database is hosted by MongoDB Atlas which is a cloud database hosting service. Both servers are able to connect and perform CRUD operations on the same DB, which was a choice I made so that computations could run asynchronously on the IBCF server and then store the most recent series of recommendation on the user records in the DB. The user-facing application can then dynamically load new recommendations as they change according to the data available at the time. The user application communicates with the IBCF service like a RESTful API in HTTP requests. Currently, the IBCF service is set to re-run computation on all users when a new rating is provided because one of my requirements is that a users new ratings improve their recommendations in the future. Technically, the recommendations should improve with greater volumes of data, but it may be marginal unless I implemented some sort of accuracy test that could tweak the way movies were compared. I may have to run this computation less frequently depending on the time it takes to run over a large set of data.

This is what I've been up to since my last post and between now and Wednesday (my presentation day) I will be conducting as much testing as I can and trying to determine the best way to demo my application so that people can see what is actually going on.

27,000,000

3/7/2019

Through reading a Medium blog post about Movix.ai I learned about another dataset that I will end up using. The dataset comes from MovieLens which is supported by a research group at the University of Minnesota called GroupLens.

The latest update to their dataset was in September of 2019, so that is fairly recent. This does not mean that my application needs to be limited to movies released pre-September 2019 though. I can use their set of movies, tags, and their rating system, and extend it to movies that are being released in IMDb's datasets. Each movie in the MovieLens dataset also contains its corresponding ID in IMDb. It contains 27,000,000 ratings by the way.

I've also now implemented a barebones set of functions for running a nearest neighbors classification in Python using standard library modules. I'll need to extend this quite a bit to handle high-dimensional data points. This can be easily invoked within a NodeJS route by spawning a child process - meaning a user updating their feedback can cause new recommendations to be made in some way, shape or form.

I want to think about my project a little differently from here on out. I have a functional understanding of how the application will run and call services and also how my views will be rendered. Now I need to see the core service, the recommendation engine, as a function which requires inputs and outputs. The inputs would be user feedback or updated data from an external dataset. The output would be a new set of predictions/classifications.

Yup, that's pretty much where I'm at.

Registration is now in session

2/25/2019

So I reached my goal for the week of getting the user registration system working. Awesome.

Rather than a screenshot of the Express project directory - here is a listing of its structure and each sub-directory's function:

client
- helpers
  - Contains all client-side scripts
- styles
  - Contains all client-side CSS files
models
- Will contain all Mongoose Object Model definitions
node_modules
- Contains all npm dependency packages
routes
- Contains all route handlers for Express app (get, post, etc)
views
- pages
  - Contains all .ejs files that represent whole pages of the application
- partials
  - Contains all .ejs files that represent components of a page
app.js --> The entry point of the application
db-connect.js --> Connects the application with the database

Per McVey's advice and proper planning, I need to start working on the workhorse of the application: the machine learning backend component. In order to start thinking about that, I need to understand the data that I have available for movies.

I was quite disappointed when I looked into IMDB's datasets. They do not include any tags with descriptors for the movies and I don't even think they contain a synopsis I could parse. I may try to find an alternate dataset. I am considering the Open Movie Database (OMDb) because they implemented a free API. All you have to do is get an API key.

This would require a bit more compute than simply parsing IMDb's datasets. I'd have to make an API call for each title in the IMDB title file. There are several other data sources I'm looking at too.

Setting up a Cloud-hosted database

2/18/2019

What I've accomplished since my last blog post:

Gained some really solid insight from our whiteboarding sessions
- Ways to organize data so that recommendations can be effective as possible
- What the rating system might look like
- Active vs passive users
Learned about the Router object that comes with ExpressJS
- It allows you - with a simple syntax - to define multiple routes very easily, especially if they are going to be accessed through similar parts of the site
Created a MongoDB Atlas free tier account
- This will be managed by MongoDB and hosted on the Google Cloud Platform for free

What I'm working on this week:

Tonight I hope to get my app connecting to the DB, I was having trouble this weekend. Mongoose, the object modeling framework for ExpressJS, gets passed a URI that holds username/password authentication to the DB but for some reason it is not being parsed correctly by Mongoose. Troubleshooting to ensue after this post
Also tonight, I am going to embed a functional PERT-type chart on my project page so that our PMs (McVey and DCP) can actually gain some visibility into my project. Also on this front, I hope to get some screenshots of the Express app directory structure in my next blog post
This week, now that I more thoroughly understand the way that Express routes its requests, I want to nail what the areas of the application will be. This means pages, links, interfaces, and forms all laid out. I don't need to have them looking fancy yet, but I want a barebones version of MY project so that I can start building my data model. I'm going to work on this Thursday and Friday nights and finish what is left on Saturday morning. My next blog post will include a top-level map of the application.

What is to come:

I need to determine which data points from the IMDB database I will need to pull into my app and how I will automate the renewal of that data each night.
I desperately need to determine how my second machine learning algorithm will work in conjunction with the first and start building a testable implementation of this. I foresee this portion of my project taking the most time to test and ultimately perfect.

"Do... or do not. There is no try." ~ Yoda

Update: My app has made contact with my database. I repeat, contact has been made.

A more solidified plan

2/11/2019

After doing some experimentation and research on NodeJS, ExpressJS, and MongoDB I've come to a few decisions.

1. NodeJS will be a good back-end because its geared towards performance and there will be lots of asynchronous processing going on in the background with the machine learning algorithms and the back and forth from the database.
2. ExpressJS is a really lightweight NodeJS routing framework, as in it allows you to use any other technologies you'd like and doesn't hide any of the low-level NodeJS functionality from you in case you need to dive deep (probably won't be necessary for this project unless the machine learning bogs things down, in which case I could implement it in something like C or C++)
3. EJS stands for Embedded Javascript Templates which has similar functionality to PHP code islands in an "HTML" file. This will allow me to build dynamic client-side components. Express knows what these are and doesn't even require a file extension when you require them.
4. MongoDB will serve as my database and will be hosted by MongoDB themselves. They offer up to 512MB of free storage in the cloud. This will allow me to move my app source code around from my local VM to the Comp Sci server and still be able to effectively access the database. Also, I won't have to worry about figuring out a way to back things up.
5. Bootstrap will serve as a CSS framework to make everything about the user interface fully responsive and groovy.

It will be a service-based application, meaning that the routing on the backend will essentially be a collection of different "services" that each perform a very specific function. This will allow for lower latency and a better user experience.

The last remaining proof of concept that I need to execute on is the two machine learning algorithms.

Those will be k-nearest neighbors and one other which I have not yet decided on.

Down the runway

2/4/2019

Since the official week of the semester (week of Jan. 21st) did not see actual capstone assignments until mid-week, I'm going to consider this week 2 of the capstone project for blogging purposes.

I've now laid out the general workflow for the project and have tentatively determined the stack that I'll be using to build the service. To see an up-to-date scoping breakdown, click here. That being said, I have a couple goals in mind for completion by next Sunday (Feb. 10th).

Tentative goals for the week:

Research Node.js implementation and determine a software design pattern/architecture that will work well with MovieRecs
Research Express and make sure it is a good fit as a JS server-side framework for the chosen architecture
Determine exactly where the data will come from and how it will be accessed
Determine - based on the dataset format - what the data model might look like and how to implement MongoDB
Research React Native and begin building some basic client-side components
Draft a diagram of the system and how it will work at a high-level (for starters)

For starters...

1/30/2019

For the first week since being assigned my project - I've not really started any sort of implementation. Due to some job opportunities I have been fairly busy studying for some assessments provided by those companies. That said, I do have a general vision of how this project will unfold.

I have been tasked with building a movie recommendation engine. Based on a user's everchanging feedback, a set of preferences will be persisted that can be used to query fitting movies from a set of data. That set of data will probably come from a movie database like IMDB. In fact, I found out that IMDB provides nightly text files that contain their entire catalogue in Tab-Delimited Format - that is one source for current information - which I could update nightly with a scheduled scrape of those files.

I am looking to run a node.js server on Ubuntu with the Express framework. One machine learning algorithm I will use is K-nearest and the other is yet to be determined. Whereever the data ends up coming from, I want it to be updated in a non-relational DB such as MongoDB because the data model is pretty liable to change throughout the development process. The client-side will be done with React Native so that it will run on iOS.

Tasks to come:

Figure out how node.js operates and determine how I will break up the different services involved with movie recommendations
Determine which other machine learning algorithm I will use and find out if there a helpful JS library that could be used for implementing that
Learn how React Native works on mobile and start thinking about the various front-end components and logical hierarchy for organizing those
Decide on a good source for data on movies