DATA Capstone Blog

Philosophy Statement

Data has the power to do incredible things when represented effectively. By properly utilizing raw data, we can uncover patterns, promote smarter decisions, and create useful applications. The only way to anticipate and improve our future is by looking at the past. Data serves as the truest compass to guide us.

Weeks 1 & 2

Planning and First Steps

I am creating a movie recommendation model that takes input from a user prompt. I have acquired a key for an API called TMDB that possesses all of the necessary data, which will be populated into an SQL database (probably a few thousand movies). The prompt will be fed into an NER model (Named Entity Recognition) that will recognize different filtering mechanisms, such as genre, year, and name.

An example prompt could be “80s sci-fi movies.” The results will be filtered for the determined attributes (1980s year, sci-fi genre) and sorted by the movie's rating. This is the essential function of the project, but I plan to expand the model to recognize more obscure suggestions (examples: “Murder mysteries in an isolated setting,” “Movies with characters that lose their minds”).

This will be accomplished using a pre-trained NLP model (likely SBERT) that will pick up on semantic similarities between the prompt and the movie's description, alongside keywords. I will need to determine sorting logic, factoring in both relevance and rating.

Aside from planning, I have created the MySQL database along with the three tables I will need. My next step will be populating the DB using data from the TMDB API. This will require a MySQL connection to Python, which I have installed in my virtual environment.

Week 3 (2/10-2/14)

Populating the Database

I have started writing the functions that will allow me to fetch data from TMDB. So far, I have functions built to populate the movie table using either a TMDB ID or an IMDB ID. I have located the top 1000 IMDB IDs by rating to get me started with populating the database.

My next step will be to write the functions for populating the people table, which will likely be more complex.

Week 4 (2/17-2/21)

Database Change and Determining Movies to Include

A major part of my project is related to semantic similarity matching with SBERT, which finds matches by comparing vector embeddings of the text data. I have made the decision to switch to PostgreSQL because of its pgvector extension, which supports efficient storage and querying of vector embeddings. MySQL doesn't have built-in support for vector embeddings, meaning I'd have to retrieve and compare them manually in Python. Using PostgreSQL is the clear choice, especially if the project ever needed to be scaled.

I have completed all of the functions to fetch the data for each table. I am currently deciding on a number of movies, but I think I will start with 10,000. The movies that I will be collecting will be determined by vote_count, which is a measure of popularity that is not biased by recency.

My next step will be training the NER model stage using Spacy in Python.

Setup link (pgvector): https://github.com/pgvector/pgvector?tab=readme-ov-file#installation
Info (pgvector): https://www.timescale.com/learn/postgresql-extensions-pgvector

Week 5 (2/24-2/28)

NER Model Setup and Training

This week was focused on developing the NER component of my project. After playing around with the base spacy model, I realized that it does a great job of recognizing variables for names, runtimes, and years. However, it does not pick up signifiers like “after”, “before”, “under”, etc. The only missing entity I need it to recognize is genre. After some more research, I found out that spacy allows you to merge a custom pipeline with the base model. Because of this, I am choosing to train a custom model to recognize entities for ‘genre’ and ‘signifier’. This pipeline will then be combined with the base model that will pick up ‘date’ (years) and ‘time’ (runtime). One thing to consider is that the model isn't compressed when you choose to only recognize specific entities (date and time), instead it just ignores all other entities. Fortunately, spacy is extremely fast, but I will still keep this in mind if I need to reduce inefficiencies later on.

My next step will be setting up the filtering mechanism using the extracted entities from the NER stage.

Week 6 (3/3-3/7)

New Computer and Cleaning Data

My previous laptop was several years old and lacked computational power. It was time for an upgrade regardless, but this project provided me with extra motivation. The new laptop I purchased is a significant upgrade, especially with regards to the GPU, which will be highly important in handling the vector operations needed for semantic matching. After purchasing the laptop, I had to reinstall all of the necessary software (PostGreSQL, Python, Jupyter Notebook, etc) and set up my virtual environment. Additionally, I had to recreate my database and transfer over the project files. So far, I am noticing a major improvement in speed!

The other focus of this week was actually populating the database. I had previously written all of the necessary functions to fetch the data from TMDB, but I was waiting for my new computer to begin populating. After trying to populate the movie table with a csv file, I quickly learned that I underestimated the amount of cleaning required for this data. There are quite a bit of random characters within the text variables of some movies. In order to populate the tables using csv files, I need the encoding to be consistent (UTF-8), which was not the case throughout the file. I attempted to convert the entire file, but there were still characters that were incompatible. It seems that I will need to manually get rid of these using find and select in excel. Using the error prompt from the command (to populate via csv), I can locate problematic rows and fix them. Fortunately, the random characters are additions, meaning the data is not compromised by their deletion.

Week 7 (3/10-3/14)

Merging Spacy Pipelines

My database has been cleaned and populated with 10,000 movies. Next, I am merging my custom spacy pipeline with the base pipeline. There is not much room for improvement with speed on these models, but they are fortunately super fast already. I will need to isolate the relevant entities from each model and handle overlapping entities. After some testing, it seems that spacy is prioritizing the base model’s decisions when two overlap (because it is the first listed when ent objects are combined). This is ideal for me because the custom model is much more prone to false identification. I have noticed (and recorded) a number of testing inaccuracies, but I will figure these out later as I am focused on building the foundation at this point.

I also spent a lot of time working on the industry capstone project. I cannot go into detail about the specific tasks I am doing (for confidentiality reasons), but I can provide a general overview. This week was spent on EDA for our different tables, as well as EDA based on certain conditions that are relevant to the project.

Spring Break (3/15-3/23)

Building a Dynamic Query from NER

After some research, it seems that the best approach for adding the parameters to the where clauses will be with a combination of hardcoding and fuzzy matching. For the structured filtering aspect of my project, there are only a small number of realistic formats/ways the extracted NER string will be in. Obviously, I would love something intuitive that could map an entity to its final output for anything possible (even something ridiculous like ‘odd number years’), but I have to be realistic and prioritize accuracy on the most common requests. My plan is to store all of the string possibilities in dictionaries along with the output parameters that will be injected into the where clauses. Fuzzy matching will add another layer of insurance for odd formats or mispelled words by finding strings in the dictionary with high similarity scores. With this solution, I should be able to pick up on any genuine query, just not those with the sole purpose of breaking the model. Additionally, I can build functions to parse strings for uncommon parameters that are not matched (like 987 minutes or longer’), but this is not worth my time at this point.

My method of hardcoding entities to determine parameters may seem concerning on the surface, but these reasons justify my decision:
1) It is exceedingly difficult to find and train an ML model that can provide the structured output I need. LLMs are extremely powerful at picking up on the intent of a query, but they are unreliable in providing a consistent output format. I may consider a solution along these lines later on, which would likely have an LLM or something similar refine the query to make intent more clear. However, this would likely reduce my accuracy on the most important and common query types.
2) Dictionary lookups are extremely fast and I can anticipate almost any query type that I will need to convert into parameters (only a few hundred maximum for each entity type). The NER component takes care of identifying pieces that become parameters which greatly limits the scope of possibilities, as well as the amount of computational effort required to refine the parameters.

Industry project update: This week was spent looking deeper into the classification variable. We are somewhat struggling due to the quality of data, but we worked hard to find any trends or information that may be valuable for modeling. We also compiled an aggregated dataset that we can use for any relevant modeling paradigm.

Week 8 (3/24-3/28)

Stage 1 Up & Running

My class demonstration was this week so I needed to get a working model together to present. The last piece of functionality I needed was the dynamic query, which I successfully built. Currently, stage 1 of my recommendation system (NER -> structured filters) is up and running. So far, I am happy with its performance, but there is still quite a bit of logic to add and cases to account for. The most notable gaps are fuzzy matching, ‘around’ signifier logic for runtime, and AND/OR logic for genres and people. My classmates and professors provided some really great feedback and there are several suggestions that I want to implement. My next focus is to build the functionality for the semantic matching stage. I am highly satisfied with my initial testing on a few embeddings and I have high hopes for how this stage will turn out. I anticipate that the functionality for this stage will not take long, so I should have plenty of time to clean up stage 1 before moving on to the ranking logic of the entire model.

Industry project update: This week was spent on creating classification models. Due to class imbalance and other issues, modeling is a challenge. However, we are pursuing several avenues to combat this. My focus was on standard decision trees and gradient boosted trees.

Week 9 (3/31-4/4)

Created Embeddings and Populated Collection Data

After starting on the semantic matching stage last week, my next step was to populate the database with the computed vector embeddings of the text data (overview and keywords). The only obstacle with this was formatting the vectors properly before insertion, but otherwise it went smoothly. Querying for similarity is really simple and I continue to be impressed with the model’s abilities. Another thing I worked on this week was adding collection data to the database, which links movies that are a part of the same series (Star Wars, The Matrix, etc). I plan to implement a feature that allows users to see rankings for a collection, and this data is essential for that. I will likely make this feature a button (similar to the buttons on chat gpt) in order to limit the complexity of prompt inputs, which are already complicated enough.

Industry project update: This week was focused on feature engineering for our final model. I worked on developing several features, mostly related to capturing the time elements of our data.

Week 10 (4/7-4/11)

Signifier and Stage-Triggering Logic

Today I went back to my structuring filtered stage in order to handle signifier logic. Before, it just assumed that signifiers preceded the relevant variable, but that was only a temporary measure to prepare for a class demo. My new solution is an added function, which occurs after I populate the data-frame with entity information. I added rows to the df for position (start, end) and signifier, which defaults to null. This information is enough to assign signifiers in the new function. It assigns signifiers based on the order of entities. If the order is not enough information (signifier is between TIME and DATE ents), it uses the start and end positions to make a decision. The output is a df with no signifier entities, which are instead assigned to other entities in the df column ‘signifier’.

I now need to work through the task of determining whether or not stage 2 is necessary. Obviously, if no entities are collected, stage 2 is needed. Another obvious case is when the prompt is completely covered by entities (example: “70s comedies”), in which case stage 2 isn't needed. It becomes more challenging when the prompt intends to hit both stages (example: “movies with a lot of guns from the 2010s”). I have built a function that extracts important data to aid me with this decision: promptLength (length of prompt in word count), numEnts (count of labeled ents), entityCoverage (ratio of prompt that is covered by labeled ents), and unlabeledTokens (list of words not labeled as ents). I had thought about incorporating another ml component to classify intent, but I believe these features provide enough information for a logic based triggering function. Additionally, it would require a lot of time and effort to properly train a model that could very well have the same accuracy as my current solution. The finished function uses a rule based approach based on entity coverage and the unlabeled tokens (which are filtered to remove stop words (words with no contextual value)).

Week 11 (4/14-4/18)

Ranking System

This week was spent on establishing the final ranking logic for the model. The three model pipelines are structured filtering, semantic matching, and a hybrid approach (structured and semantic). Each of these pipelines behaves differently and requires unique considerations for how they rank results. The pipelines that handle semantic matching are complicated because their rankings factor in both similarity and rating/popularity. The structured filtering pipeline is easier to rank, but a few special cases must be accounted for.

Genre Concern: There are a number of movies that cover a wide range of genres, meaning they are classified under several categories. Because of this, I need a way to distinguish between both cases. If someone is searching for a comedy, it doesn't make sense to list something like “Forrest Gump” or “Parasite” above “The Hangover”, despite the fact that they are rated higher and labeled as comedies. After looking at the data, I think it is viable to create a feature for “genre confidence”, which will be higher for movies that fall into one category. This feature can serve as a metric in the ranking logic, boosting movies with clearly defined genres.

Hybrid Model Concern: I am noticing a type of query in the hybrid approach that I need to account for. The model already does a great job of removing contextually irrelevant words before semantic matching, but occasionally certain words are left behind that shouldn't be. In all cases (I have noted), there is only a single word left behind. I have thought about a number of solutions, including even just adding these words to the list to be ignored, but that is unwise and certain to leave gaps remaining. I have elected for a uniquely effective solution that involves semantic matching. The idea was to create a wordbank full of contextually relevant words (like “apocalypse”, “alien”, or “europe”) and compare the remaining word to these semantically. If a match is above the threshold, stage 2 is triggered. Otherwise, the leftover word is ignored and stage 2 is untriggered. After testing this idea with the same semantic model, I am impressed by the results. Although this solution isn't perfect, it's better than adding to a list of irrelevant words that will inevitably grow with time.

Industry Project Update: We are now far along on the project and have decided on our final approach/deliverable. Our lack of data for the class we are predicting has brought several challenges, but also valuable experience with handling flawed data. Our main goal is to maximize the value of this data in both modeling and analysis.