Case Study: Netflix Big Data Analytics- The Emergence of Data Driven Recommendation

12 min readDec 13, 2022

1. Introduction

Netflix is one of the largest online streaming media providers in the market nowadays. It started in 1997 from selling DVDs and providing them on rental basis. But with the passage of time and change in market and user needs, Netflix had to change its business model to video streaming. Nowadays a lot of other video streaming platforms are providing good content like Hulu, espn, disney+ etc, in order to stay in the market and attract customers to them, netflix uses big data analytics for their recommendation systems. This recommendation system helps to recommend movies and shows to customers based on their interests and needs. Using the big data collected from subscribers such as the location of a user; content watched by the user, data searched by the user and the time at which the user watches any show, Netflix uses it to provide the best service and products to their customers. Based on these data parameters, algorithms are trained to provide the best user experience. Today Netflix has a total of 203.67 million [1] subscribers all over the world which makes it the largest streaming media in the market since its launch in 2007. This case study mainly focuses on how Netflix uses big data to enhance their recommendation system and some marketing & business strategies/suggestions based on SWOT and PESTLE analysis.

2. Challenges

2.1 Data Sources

The main data sources used for this case study are Netflix Inc website, Blockbuster Inc website, Big Data analytics blogs, recommendation system blogs and multiple conference & journal articles related to bigdata and recommendation systems. Also google scholar was primarily used as a search engine to retrieve the literature relevant to the case study. Generally for recommendation systems, netflix uses data like billions of ratings from their users. The geographic data like location, time and date; user interest data like the content they signed up for, the device which they are using to stream it, the points at which the show was paused, resumed or fast forward, or if the show is watched repeatedly are used. Moreover, data points related to how long does it take the user to watch the show? either the watch was at normal speed or 2x speed etc. are collected. Either the whole movie was watched or not, if yes if they have watched any scenes repeatedly or not, if they have rated the show or not, given the thumbs up, their browsing history etc all are recorded and act as source of data for algorithms. All this data is extracted from user interests and basically they are the source of data. This data is used to fill in the two-tiered row-based ranking system, where ranking happens within each row and across rows.

2.2 Size Of Data

Earlier in 2006, Netflix organized the first competition to create an algorithm which should be able to provide recommendations based on customer interests. For this purpose, Netflix provided the dataset which contains about 100 million ratings provided by 480 thousand users for 17 thousand movies. In short, the size of the dataset for this recommendation system consists approximately of information from five billion titles [2].

2.3 Data Storage

The data generated by Netflix has increased to 1000% from 2016 to 2019. Exact figures are not given how much data storage is used for this project, but we can imagine how big storage is required to store data generated by 760,000 hours of watch on Netflix per minute globally which is used for recommendation systems. Below figure shows some figures related to data in quarantine.

*Figure 2: Data generated by Netflix during Quarantine*

2.4 Data Ownership

There is no particular answer to this question as in an internet medium it’s complicated to say who owns the data but as data is provided by netflix, so netflix as a company owns this data. They are one who have permission to make data public or private like data is available on kaggle for research purposes.

2.5 Data Access Rights & Data Privacy Issues

In order to train the algorithm netflix uses data from almost 480000 users in 2006, which successfully increased the prediction of algorithm by 10% but the company released 100 million anonymized movie ratings which includes unique ID, movie title, year of release and the date on which the subscriber rated the movie. Just after some time, the professors of the University of Texas identified some of their known netflix users’ information in that dataset. They figured out somehow by using reverse engineering, from that data one can find the browsing history of the user which is against the privacy laws. When netflix wanted to launch the second competition in 2010 using personal information like age, gender etc, they got a lawsuit on them. Due to which they had to cancel it as one lady filed a suit against them by claiming she was harassed due to her information leaked in the dataset that has been used in the competition [3].

2.6 Data Quality Issues

In my opinion, for a training model you need data to be cleaned and it’s hard to clean big data, which further decreases the quality of data. Also, if one account is used by two users, one user rated the movie 5 and the other watched the same movie and rated it differently, there is multiple data in the dataset, and it will not be redundant. Giving different views to the same movie will decrease data quality as the model won’t be able to predict the mood of the user.

2.7 Faced Organizational Challenges

Netflix started with renting DVDs to the customers without having the physical stores which kept them in a loss for some time as they used to ship DVDs to houses which caused a lot of cost per customer. They moved their model to subscription type where customers were renting DVDs in exchange for subscribed fees. Also they were not charging late returning fees, thus in early times, they were in tremendous loss. The other challenge was to maintain the customers as competitors like hulu, espn+ etc have good content too. For this purpose, Netflix introduced the recommendation system. Netflix invested heavily into big data analytics to recommend the best content for their users.

2.8 Technical challenges

Main technical challenges faced during projects by Netflix are:

● How to increase the quality of training algorithms?

● Collection, compilation, cleaning and training of big data needs sophisticated tools with high speed and less computations.

● As one algorithm is not used, multiple algorithm’s outputs are cascaded to get one final result, this cascading needs a lot of resources.

● The goal of the contest was to achieve RSME value more than 0.8563 on the quiz subset, which was hard to achieve.

● Training the model over the minimum point of the probe was a challenge.

● Training and optimization of predictors

● Replacement of the single neural network blend by an ensemble of blends

● The sheer range of the data was one of the problems as some users in the set had only reviewed 9 movies, which was the minimum to be included in the model and gave very little information to read the minds of users. On other hand, some users had rated a lot of movies making the data unbalanced.

● The big problem was training 100+ models in parallel and tuning all parameters simultaneously is not feasible.

3. Stakeholders

3.1 People/organizations interested in the conduct and outcome of the study

The people who conduct this study are Srivatsa Maddodi and Krishna Prasad. K. from Srinivas university, which are the primary stakeholder of this study. Anyone reading this case study could be the secondary stakeholder due to their interest in this topic. But on a wider view, for any Netflix study, the stakeholders will be their viewers and subscribers as the reason for each study is to enhance business models for them. Even this recommendation system is going to recommend shows to their subscribers which makes them the primary stakeholder. The secondary stakeholders will be the employees of the Netflix research team working on these algorithms. The competitors like Hulu, Disney+, ESPN etc could also be interested in these kinds of studies and their outcomes so they can act as secondary stakeholders. The respective competition was won by BellKor’s Pragmatic Chaos, who turned out to be stakeholders too.

4. Requirements

4.1 Hardware & Software Resources

Some of the hardware/ software resources used were:

● Netflix stores all of its data on amazon web services S3 storage which enables them to spin multiple Hadoop clusters for different workloads and allow them to access data at the same time.

● They use Teradata as their relational data warehouse but in the future are planning to move to Amazon Redshift.

● The tools like Hive for ad hoc queries and analytics, Pig for ETL and algorithms, Java based mapReduce for complex algorithms are the backbone of Netflix.

● Python is the language for scripting various ETL processes and Pig User Defined Functions.

● Netflix uses Amazon’s Elastic MapReduce (EMR) for distribution of Hadoop.

● Apache Chukwe which is built on top of HDFS and MapReduce framework is used by Netflix as a collection system for collecting logs or events from a distributed system [4].

4.2 Commercially & internally developed resources

Following are the internally developed resources for these recommendation models:

● They are implementing Hadoop Platform as a Service (Genie) in the future which will provide a higher level of abstraction. It will enable administrators to manage and abstract out configurations of various back-end Hadoop resources in the cloud [5].

● They also created Hermes who aimed to provide subtitles on netflix [6].

4.3 People & expertise resources

Netflix is a data driven company and almost 80% of searches are done by recommendation algorithms which are trained by expert data engineers. According to statistics, Netflix had almost 800 data engineers working in silicon valley who are expertised in machine learning, deep learning, artificial intelligence etc. In terms of case study, the Netflix competition in 2006 was won by ‘BellKor’s Pragmatic Chaos’ which was a combined team of BellKor, Pragmatic Theory and BigChaos. BellKor consists of Robert Bell, Yehuda Koren and Chris Volinsky. The members of Pragmatic Theory were Martin Piotte and Martin Chabbert. Andreas T ̈oscher and Michael Jahrer joined from the team BigChaos [2].

5. Time Duration

5.1 Approximate project schedule & duration

The data used in training was collected over the span of 7 years and the competition/ project ran for two years from 2006 to 2008 whose winner was BellKor’s Pragmatic Chaos Team.

6. Results & Findings

6.1 Results & answers achieved

Netflix is one of the early adopters in big data analytics. This use of big data analytics has resulted in customer satisfaction which makes Netflix one of the largest online streaming platforms. Data analytics has helped Netflix in production of original content like House of Cards which turns out to be the first super hit attempts by the company to use a data-driven approach for content creation [7]. This recommendation and personalized system helps Netflix to save billions of dollars a year with less cancellation rates. Nowadays 80% of the content watched by users are the outputs of recommendation algorithms.

In terms of a case study’s project, the winning algorithm was able to increase prediction accuracy by 10.6% and to decrease RMSE to 88%. Individually, SVD was able to reduce the RMSE to 89.14% whereas RBM reduced RMSE to 89.90%. This Netflix prize boosted the collaborative filtering research too.

6.2 Successfulness of Project

Yes the project turned out to be successful as the winning algorithm was able to increase prediction accuracy by 10.6% and to decrease RMSE to 88%. Individually, SVD was able to reduce the RMSE to 89.14% whereas RBM reduced RMSE to 89.90%. This Netflix prize boosted the collaborative filtering research too. These recommendation algorithms help Netflix to save billions of dollars every year. Moreover, these algorithms helped in attracting more customers to their market which generates a huge revenue.

6.3 Discovered surprises

Some new findings are listed below:

● Expanding the training set by including the probe set in the 2nd attempt, predictions achieve a 0.0030 to 0.0090 better quiz RMSE as compared to their probe RMSE.

● The predictors which achieve the best blending results are the ones, which have the right balance between being uncorrelated to the rest of the ensemble and achieving a low RMSE individually.

● Each predictor tried to achieve best results when it was blended with all preceding ones.

● Movies to be rated are selected deliberately by the user, and are not a random sample

● Models based on matrix factorization are found to be most accurate.

● Moreover, SWOT and PESTLE analysis were done in case study which gives good insights.

6.4 Learnt Lessons from Project

Training the models individually or in parallel fashion does not give best results, rather using the output of one algorithm as input of another is the most effective method. So, ensemble techniques are the best. All data features are never useful, like in this project a vast variety of features were extracted but only some were able to give the user’s view. Moreover, they didn’t increase the accuracy of the tuned collaborative filtering model.

6.5 Actions taken as a result of the project

As a result of the project, these algorithms in collaboration with others, were used to create personalized homepages for 100 million users. The code that won the competition is still used by netflix to develop advanced recommendation models. As a result of the success of this project, nowadays almost 80% of movies recommended to users are based on these recommendation models.

6.6 Value to the organization & stakeholders

As a result of the project the BellKor’s Pragmatic Chaos — winning team which was one of the secondary stakeholders, won a $1 million award after the hard work of two years. Viewers, who are also the stakeholders of this project, were benefited in a way that for them browning their favorite shows on Netflix becomes easy. This made Netflix a more user friendly platform as compared to other competitors. As an organization, these recommendation models help Netflix to attract a lot of subscribers to them which makes it the largest streaming company in the world. These models save Netflix about a billion dollars every year and increase their revenue [8].

7. Critique

7.1 Ideas for project improvement

Netflix uses A/B tests to personalize user experience. Whatever it is shown on the platform (eg: images) is driven by data collected by A/B test. The procedure and the steps for A/B testing can be improved by including the evaluation through circumstances rather than algorithmic. It can use reinforcement algorithms to provide recommendations to users as opposed to the traditional methodology of recommendation systems. The reward can be user satisfaction, the state can be the current content and the action can be the next best content recommendation [8].

8. Technical terms 8.1 Collaborative filtering

This kind of recommendation system is based on similar profiles of users. To create a subscriber profile the recommendation system mainly focuses on these two informations: Subscriber preferences and History of subscriber. For instance, if subscriber A watches crime, action, horror movies and subscriber B watches crime, action, comedy movies then subscriber A will like to watch comedy movies and subscriber B will like to watch horror movies.

8.2 Restricted Boltzmann Machines

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that learns probability distribution over its set of inputs. RBM are used to model tabular data like user ratings of movies. So, they are frequently used in recommendation systems. Most of the existing approaches to collaborative filtering cannot handle very large data sets but when combined with RBM, overall achieved error rate could be less.

8.3 Matrix Factorization

Matrix factorization is a class of collaborative filtering algorithms which is used in recommender systems. It works by decomposing the user-item interaction matrix consisting of ratings into the product of two lower dimensionality rectangular matrices.

9. References

[1] Data Science at Netflix: How Advanced Data & Analytics Helps Netflix Generate Billions. Retrieved on 03/30/2021 from:https://www.aidataanalytics.network/data-science-ai/articles/data-science-at-netflix-how-advanced-data-a nalytics-helped-netflix-generate-billions

[2] The BigChaos Solution to the Netflix Grand Prize. Retrieved on September 5, 2009 from: https://www.asc.ohio-state.edu/statistics/statgen/joul_aut2009/BigChaos.pdf

[3] 5 data breaches: From embarrassing to deadly. Retrieved on December 14 2010 from:https://money.cnn.com/galleries/2010/technology/1012/gallery.5_data_breaches/index.html

[4] Netflix System Architecture. Retrieved on Nov 4, 2021 from: https://medium.com/interviewnoodle/netflix-system-architecture-bedfc1d4bce5

[5] Hadoop Platform as a Service in the Cloud. Retrieved on Jan 10, 2013 from: https://netflixtechblog.com/hadoop-platform-as-a-service-in-the-cloud-c23f35f965e7

[6] Netflix HERMES translation test completed by thousands globally. Retrieved on March 31, 2017 from:https://indianexpress.com/article/technology/social/netflix-hermes-translation-test-completed-by-thousand s-globally-4594100/

[7] How Big Data Helped Netflix Series House of Cards Become a Blockbuster?. Retrieved on 2019 from:
https://sofy.tv/blog/big-data-helped-netflix-series-house-cards-become-blockbuster/

[8] Netflix Recommender System — A Big Data Case Study. Retrieved on Jun 28, 2020 from:https://towardsdatascience.com/netflix-recommender-system-a-big-data-case-study-19cfa6d56ff5

Case Study: Netflix Big Data Analytics- The Emergence of Data Driven Recommendation

Written by durafshan jawad

Responses (1)