Introduction
In the digital age, streaming services like Netflix continue to evolve by offering diverse content. To enhance user experience, machine learning-based recommendation systems play a crucial role in providing content tailored to individual preferences.
This project develops a content-based recommendation system using both supervised and unsupervised learning algorithms to identify patterns in data and suggest relevant movies to users.
Methodology
1. Data Preprocessing
Before building the model, data preprocessing was performed to ensure data quality:
- Handling missing values in columns such as director, cast, country, date_added, rating, and duration.
- Removing irrelevant attributes like duration and date_added.
- Encoding categorical variables using Label Encoding and Bag of Words to transform text into numerical representations.
2. Exploratory Data Analysis (EDA)
Initial analysis of the Netflix dataset revealed several interesting patterns:
- Content distribution by type shows that 69.6% of Netflix content consists of movies, while 30.4% are TV shows.
- The United States produces the highest number of titles, with 2,819 pieces of content.
- Highly correlated genres include Action with TV Action & Adventure and Romantic Movies with Romantic TV Shows.
3. Implementation of the Recommendation System
This project employs two primary approaches:
- Content-Based Filtering: Uses cosine similarity and bag-of-words to calculate movie similarity based on attributes such as directors, actors, genres, and country of origin.
- Clustering & Graph Representation:
- K-Means Clustering groups movies with similar descriptions using CountVectorizer.
- A NetworkX Graph is created where nodes represent movies, actors, directors, and genres, and relationships between entities are analyzed using cosine similarity.
For example, the system provides top recommendations for:
- Ocean’s Twelve → Movies with similar action and crime elements.
- Stranger Things → TV shows with mystery and adventure themes
Network Analysis
Top Recommendation for Ocean's Twelve
Top Recommendation for Stranger Things
Model Evaluation
Various machine learning models were tested to measure performance:
| Model | Metric | Value |
|---|---|---|
| KNN | Accuracy | 1.00 |
| Decision Tree | Accuracy | 0.8638 |
| Random Forest | Accuracy | 0.8383 |
| Logistic Regression | Accuracy | 0.3990 |
| Naive Bayes | Davies-Bouldin Index | 1.310 |
| K-Means Clustering | Davies-Bouldin Index | 1.451 |
Evaluation Insights:
- KNN achieved the highest accuracy (100%), making it highly effective for recommendation purposes.
- Decision Tree and Random Forest performed well, with accuracy above 83%.
- Logistic Regression performed poorly, likely due to the non-linearity of feature relationships.
- Naive Bayes and K-Means Clustering produced well-structured clusters, with Davies-Bouldin Index values indicating good cluster separation.
Business Recommendations
Based on the findings, several business strategies can be implemented for the streaming industry:
- Personalized Content: Using KNN as the main model can improve user experience by providing more accurate movie recommendations.
- User Segmentation: Applying K-Means Clustering to group users based on their viewing preferences, allowing for targeted content recommendations.
- Marketing Optimization: Leveraging recommendations based on similarities in actors, directors, genres, or country of origin to engage users more effectively.
Conclusion
The recommendation system developed in this project utilizes multiple machine learning methods to improve the accuracy of movie suggestions for Netflix users. By employing content-based filtering and cluster analysis, the system offers a more personalized and relevant streaming experience.
