The Application of Linear Algebra to Recommender Systems:

Linear Regression and Singular Value Decomposition Approaches

SAMPLE FINAL REPORT

Professor E. G. Puckett.

MAT 167

Spring Quarter 2014

Table of Contents

Introduction and Motivationâ€¦â€¦â€¦â€¦â€¦. 1

Data Setâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦. 2

Methodsâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦. 2

Resultsâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦ 6

Summary and Conclusionsâ€¦â€¦â€¦â€¦â€¦â€¦. 9

Figuresâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦… 10

Tablesâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦ 15

Appendix: Codeâ€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦. 17

1

Introduction and Motivation

The ubiquity of Internet access and its ever-growing role in activities ranging from

commerce to personal leisure has created an explosion in data over the last decade. Much of this

data is gathered from user-generated behavior within a web browser whether or not the user is

aware of its collection. The quantity and richness of this data has helped foster demand for a userdriven model of online experiences, particularly with regard to hedonic pursuits.

As the realm of human experience has increasingly shifted from the physical space to the

digital, businesses have attempted to exploit data on website users in order to increase revenue or

provide a better service in order to retain loyal customers. With consumersâ€™ brief attention spans

mercilessly pulled from one corner of the Internet to the next, a businessâ€™ survival may depend upon

delivering the right material to the right people at the right time. A recommender system is one

vehicle by which this goal is accomplished. Such systems are algorithms that analyze data for

revealed user preferences and translate those preferences into targeted content. User data may come

in the form of clicks on links, time spent reading a given page, or ratings provided for a product.

Recommendation systems come in forms both subtle and explicit. Services such as Google

News analyze consumer interests by browsing habits and recommend news stories tailored to usersâ€™

tastes. Advertisers collect data on website visits in order to classify a user into pre-specified market

segments. Online shopping goliath Amazon.com records purchases and monitors previously viewed

items in order to suggest products that will catch a customerâ€™s eye. Streaming music providers like

Pandora use song or artist seeds to create a station suited to an individual listener. And, perhaps

most famously of all, Netflix makes use of movie-watching history and user ratings in order to

recommend movies the user has not seen but is likely to enjoy.

As the first online movie rental service to excel in its niche, Netflix developed a physical and

digital delivery system and unique business model that rapidly put pressure on its store-based

competitors that retained an antiquated and unpopular regime of due dates and late fees. Netflix has

remained competitive by, among other things, maintaining a state-of-the-art recommender system

that predicts preferences with reasonable accuracy in order to retain the interest of consumers and to

keep the service fresh and relevant. Netflix famously sought to enhance this recommender system

through a public competition. Though the Netflix Prize Contest ended in 2009, the idea of movie

recommender systems has remained a teaching point and a case study in the field of machine

learning.

Recommender systems are categorized by the kind of data on which they rely. So-called

content-based filtering systems build a profile of users and products; this kind of system relies on

the relationship between a single user and the available products. The user profile indicates the

kinds of product features a user prefers (e.g. a user watches mostly science fiction movies) and the

product profile indicates to what degree the product can be placed in a certain category (e.g. a movie

may be science fiction or a romantic comedy, but rarely both). In contrast, recommendations made

by a collaborative filtering system are formed by an analysis of similarities among users and

products; this system relies principally on the relationship of users to users or products to products,

rather than users to products. It is common in movie recommendation to use a hybrid contentbased and collaborative filtering system that relies on all the movies to receive at least one rating by

some user, and recommends movies to users based on their own preferences. This approach is

particularly vital for the regression and singular value decomposition approaches employed below.

2

Data Set

Although the Netflix Prize data, including nearly a half-million users and 18,000 movies,

were released for the duration of the competition, they are no longer publicly available.

Nevertheless the University of Minnesotaâ€™s GroupLens Research team has released a similar data set1

with ratings of 1682 movies by 943 users from the MovieLens database. These ratings, like those from

Netflix, come in integer values between 1 and 5, sometimes called the number of â€œstarsâ€ a user gives a

movie. Thus these data can be used in an identical way to the Netflix Prize data. The MovieLens data

come with a variety of additional information, including the genre of each movie in the database. While

there are ways to make use of information about movie features, this project is focused on specific

applications of linear algebra and will make predictions based only on known user ratings. Movie genres

will be discovered ex post, rather than entered ex ante.

Due to computational limitations it was impracticable to work with the entire MovieLens

data set. Even with its relatively small size of 1.5 million entries (an order of magnitude smaller than

that of the Netflix data), the rating matrix is simply too large to analyze in a reasonable amount of

time on a humble laptop computer. Instead I elected a much smaller subset of the data, choosing 20

movies and 200 users at random. The purpose of choosing such a small data set was to make the

results easy to interpret and present, and to keep the computations brief enough that they could be

repeated many times under different conditions to explore the effects of changing model

parameters. With enough resources, however, the methods described in this report could be

employed with every movie and user provided in the MovieLens data.

Methods

Why Matrix Factorization Works

The primary interest of the movie recommender system is to predict a userâ€™s ratings for

movies he has not seen, and recommend those that he is likely to enjoy. At the heart of this

method is the assumption that a userâ€™s rating for any given movie is a linear combination of a vector

describing the degree to which the movie contains certain characteristics and themes (the genre vector)

and the degree to which the user enjoys these characteristics (the preference vector). Each movie,

containing a unique combination of elements from different genres, has its own genre vector whose

elements correspond to the level of each characteristic; for example, a romantic comedy would

contain high values for humor, romance, and feel-good categories and low values for action or

thriller categories. Similarly, because of the uniqueness of personal preferences, each user has his or

her own preference vector; a hopeless romantic would have high values for the romantic comedy

characteristics just described and perhaps low values for the action and thriller categories.

In the context of movie ratings, matrix factorization methods essentially break down the

ratings matrix into component matrices representing movie genres and user preferences. Each

movie and each user is assigned a vector in feature-space (genres for the movies, preferences over

those genres for the users). When the interests of a user align with the genres of a movie, that user

is likely to enjoy the movie; geometrically, this is like a close alignment of the genre and preference

vectors in space. When user preferences do not agree with a movieâ€™s features, the genre and

preference vectors are orthogonal. Therefore, the inner product of the movieâ€™s genre vector with

the userâ€™s preference vector would provide a prediction about the userâ€™s enjoyment of the movie.

However, we do not have the right data in the first place: we lack both the genre vectors of

movies and preference vectors of users. All we have is a sparse rating matrix! Moreover, while we

1

Available at http://www.grouplens.org/node/73.

3

require genre and preference vectors, these are not our primary ends in movie recommendation;

rather, we want the predicted user ratings themselves. Because the vectors of interest are unknown

to us, these genre and preference parameters are called â€œlatentâ€ factors and we must discover them

by means of collaborative and content-based filtering. If there is at least one rating for every movie

by some user, then we can discover the genre vector that makes the best rating predictions. The

converse is also true: if there is at least one rating for every movie then we have enough data to

discover the preference vector that makes the best prediction.

Linear Regression Approach

Given the nature of MovieLens rating data, in which the diversity of user rating strategies

makes rating distributions highly non-normal and varied across users, the notion of multiple linear

regression for user rating predictions would seem something of a folly. Instead, however, the

method is quite powerful with the aid of a carefully controlled gradient descend algorithm.

Formally, we would represent the user-specific linear model as

where is the rating given by the

user for the

movie. The parameters stand for the

values of movie jâ€™s genre vector for genre g, where the (arbitrary) number of genres is G. The userâ€™s

preference vector is represented by the parameters with similar notation. (The intercept is

omitted for reasons that will become clear.) More generally, this linear model defines a (J*I) x 1

rating vector whose entries may be arranged in a J x I rating matrix Y with movies as rows and users

as columns, whose entries are defined by

where is the J x G matrix formed by stacking movie genre vectors and P is the I x G matrix

formed by stacking user preference vectors. In reality, this rating matrix is sparse, meaning that

most of its entries are empty because not every user has rated every movie. Thus the challenge is to

use a form of collaborative filtering in order to impute these missing ratings as accurately as possible.

The engine of linear regression is the minimization of the sum of squared errors (SSE), and

in order to get errors we need predictions. This problem is one of concurrently predicting the genre

and preference matrices and . In order to accomplish this, we define the sum of squared errors

as

âˆ‘ (

)

( )

where indicates the set of all pairs (i, j) for which user i has provided a rating for movie j, where a

bold character indicates a vector, and where is the actual rating provided by the user. Our

objective is to predict genres and preferences in a way that minimizes this error. Furthermore, we

are interested in obtaining small and realistic values of the genre and preference vectors.

One danger of unchecked simultaneous optimization is that the parameters may take any

values necessary to achieve the objective. This problem can lead to excessively large values of some

parameters and exceedingly small values of others, posing the problem of overfitting the model to a

specific data set where such parameters may produce good estimates but fail to generalize to other

samples. The simple solution is to penalize (or â€œregularizeâ€) the elements of the genre and

preference vectors so that the individual elements do not become too large. Therefore we let

denote the regularization parameter and define the objective function as

4

âˆ‘ (

)

( )

âˆ‘ âˆ‘

âˆ‘ âˆ‘

where we optimize with respect to genre and preference vectors simultaneously, controlling the size

of and by introducing squared terms into the cost function so that their expansion in the

positive or negative direction is checked by rapidly increasing penalties. The value of need not be

large, and in fact by the nature of the cost function the root-mean-squared error of the predictions

will increase as increases. Nevertheless, its presence in the objective is critical to a plausible

outcome.

With an infinite number of combinations of the parameters to test, the optimization is

computationally unwieldy without a well-defined rule for sniffing out the minimum. Rather than

map out every value of SSE in parameter space, we compute the error at a given value of our

parameters, we determine the slope of the SSE in that neighborhood, and we ride the slope

downward toward a lower error and repeat. In calculus the gradient of a function is defined as the

direction of steepest ascent for a given value of inputs and is computed as the vector of partial

derivatives of all variables evaluated at those inputs. Thus we employ a method known as gradient

descent to determine, at some point in parameter space, the appropriate change in parameters that

reduces the SSE most efficiently. This procedure employs a parameter known as the learning rate,

defined as the size of the â€œstepâ€ down the gradient before the next computation. A larger learning

rate will seek out the optimum more quickly, but too large a value will make too large a step and will

risk overshooting the minimum and increasing the parameters uncontrollably.

Because we predict both genres and preferences, the gradient descent is applied to both the

genre and preference vectors in each iteration of the algorithm, a process called the gradient descent

update. Using the partial derivatives of the objective function and the learning rate, the updates are

(âˆ‘(

)

)

(âˆ‘(

)

)

which define new values of the parameters

and

. These are then tested in the objective

function again and another round of gradient descent is conducted.

Thanks to the gradient descent update, we can in theory start from any point in parameter

space and, given enough iterations of the algorithm and assuming we do not fall into a local

minimum, we can work our way toward the global minimum of the objective function. In this way

the machine can â€œlearnâ€ or discover the optimal parameters, once we have specified an arbitrary

quantity of genres. This is also the reason that we do not need an intercept in our original linear

model: if an intercept term belongs in the model, the algorithm will adapt and learn this feature.

Therefore, we assemble the final algorithm in the following way: first, we initialize the values of the

genre and preference vectors to small random values; second, we run the gradient descent update

until the convergence of the objective function to its minimum; and finally we predict ratings with

the inner product of the genre and preference vectors for each user and movie. This process is

considered a low-rank matrix factorization procedure because a minimization is used to approximate

the actual user ratings while the predicted rating matrix is decomposed into preference and genre

matrices, so that the final prediction is given by

Ì‚ Ì‚ Ì‚

.

5

Without much surprise we note that this result is simply the estimated factorization proposed by our

linear model above. With some exploration of the modelâ€™s behavior and the right combination of

learning rate and regularization parameter, this method performs astoundingly well in terms of

minimizing the difference between prediction and actual rating. The predicted star rating will, of

course, not occur in integer values like the ratings supplied by users. This does not pose an obstacle:

in fact, Netflixâ€™s â€œbest guessâ€ for ratings has always come in the form of fractions of stars, and there

would be tremendous loss of information if predictions were rounded to the nearest star.

Singular Value Decomposition Approach

Many of the principles that make prediction effective in the case of linear regression also

apply to another commonly implemented recommendation technique, the singular value

decomposition (SVD). Like regression, the SVD also relies on discovering component matrices

whose product is the ratings matrix, and on a gradient descent update to refine those factors.

Furthermore the component matrices are the genre and preference matrices whose product is the

prediction of use ratings for a movie.

The singular value decomposition factors any real matrix A into three components: a matrix

of (orthogonal) eigenvectors of AAT

, a matrix of (orthogonal) eigenvectors of ATA, and a diagonal

matrix of the singular values (the square roots of the eigenvalues of ATA). This SVD can serve the

purpose of dimensionality reduction because it orders the eigenvectors and singular values by the

size of the singular values; that is, by the relative importance (i.e. amount of information stored) of

each vector. Then the most important vectors can be chosen and recombined to produce an

approximation of the original matrix. This is relevant to the movie rating matrix because the act of

discovering the latent genre and preference vectors is exactly the problem of dimensionality

reduction: insofar as the assumption holds that movies are categorized into genres and users may

have similar tastes, we may use the SVD to discover the hidden features that characterize these

similar movies and users into fewer groups.

Specifically, if we let our sparse rating matrix be the matrix A just described (filling in

missing values with the mean rating by movie), we can decompose it with the SVD. What emerges

is a complete picture of all possible genres (as many as there are movies) and all user preferences (as

many as there are users). In other words, by the SVD we obtain

where, as for linear regression, represents the matrix of movie genres and represents the matrix

of user preferences. Here, represents the diagonal matrix of singular values. Some of these genres

and preferences do not actually contain much information on the variety of the movies themselves,

but SVD has already done the job of sorting out the important genres by ordering the vectors

appropriately. Now we can choose the k most important genres by taking appropriate subsets of

the three component matrices.

With these subset matrices, we can derive a prediction for all the entries of the original

matrix A by multiplying the factors back together. This prediction will be reasonable for a first

approximation, but a gradient descent is still needed in order to refine the prediction. Specifically,

after subsetting the component matrices, we merge the genre matrix with the matrix of singular

values :

Ìƒ

where Ìƒ . We then proceed with an iterative gradient descent update process. In the SVD

setting we do not explicitly pursue the objective of minimizing the sum of squared errors as we do in

linear regression, but our updates are still guided by seeking the lowest possible difference between

predicted and actual ratings.

6

We define the predictive error (for all movies with a rating), , as

where, as above, we have the genre vector

,the preference vector

, and the actual rating . To

find the descent trajectory, we use the partial derivative of the squared sum of errors with respect to

each parameter. As before, we subtract this derivative (times a learning rate) from the original

parameter in order to update the genres and preferences. The updates are given by

âˆ‘(

)

âˆ‘(

)

where is again the learning rate. After the completion of the gradient descent update, we have a

predicted genre matrix

Ì‚Ìƒ and the predicted preference matrix Ì‚ which we multiply for the final

prediction,

Ì‚ Ì‚Ìƒ Ì‚

Results

Computational Challenges

In order to accomplish the gradient descent updates required for accurate prediction, the

algorithms implemented in this study compute anywhere from tens of thousands to hundreds of

thousands of matrix multiplications with each run of the custom functions. Thus the machine time

required for a single run can quickly escalate when very accurate predictions are desired. The

computation time is highly dependent both on the size of the matrix being computed and the

number of gradient descent iterations used. Moreover, both the matrix size and number of

iterations have a bearing on the accuracy of the final prediction; a larger matrix contains more

information about users and movies, and more iterations allows for a further descent into an error

minimum.

Figures 1 and 2 report the effect of matrix size and number of iterations on computation

time for a single run of both predictive approaches. Happily, Figure 1 shows that computational

time increased linearly in the number of matrix entries for both the regression and SVD approach,

indicating a stable function that operates as efficiently as it can as the matrix grows. The SVD

approach has a slight time advantage over the regression approach, an advantage that appears to

grow with the number of entries. This result indicates that, for my chosen default matrix of 20

movies and 200 users, a regression approach with 10,000 iterations requires about 5.5 seconds, while

an equivalent SVD approach requires about 4.5 second.

Not surprisingly, the two functions also exhibit linear time consumption trends in the

number of iterations in the gradient descent, as demonstrated in Figure 2. Here again the SVD

method had a slight advantage that widened as the number of iterations grew. However, the

regression algorithm could achieve more accurate results with fewer iterations, giving it a

comparative advantage in the end. In fact, regression could perform better with 2,000 iterations

than SVD could with 200,000 iterations. Thus regression was a more efficient algorithm.

7

Number of Iterations

With the root-mean-squared error (RMSE) as the judge of performance, we can quantify

how much better the regression approach performed. Both functions were run using successively

larger numbers of iterations ranging from 10,000 to 200,000. The RMSE was computed for each

and the results were plotted in Figure 3. Although regression has a slightly longer run time for any

given number of iterations, it has a vastly superior predictive capability. In fact, over the range

tested, SVD never outperformed regression at a given number of iterations. Moreover, the

regression approach appears to have settled into a local minimum at around 50,000 iterations, with

the very low RMSE of 0.07. In contrast, even at 200,000 iterations, the RMSE of the SVD approach

was still descending, presumably toward some minimum that it could achieve if given enough time.

It should be noted that, although these two functions employ similar gradient descent methods, they

start with very different matrices. Thus their descent path toward a minimum is also quite different,

a fact that may help explain their different behavior.

Learning Rate

There is another factor that causes the disparity in Figure 3, however. Recall that the

learning rate is a control parameter that determines how quickly the gradient descent moves in the

direction of minimum error. If this parameter is too small, the steps toward the best predictions are

also too small and finding the minimum is too slow. However, if the parameter is too large, the

steps will also be too large and will overshoot the minimum, possibly causing a gradient ascent away

from the minimum instead of a gradient descent toward the minimum.

Figure 4 plots the RMSE by value of the learning rate. Both methods were highly sensitive

to the value of the learning rate and pinning down a good range for the learning rate was difficult.

Finding the best rate was especially challenging for the SVD approach. Although Figure 4 does not

reflect it, the RMSE could not be resolved with a learning rate larger than about 8×10-5 in the SVD

case. The learning rates thus had to be quite small in order for SVD to return real results. This is

another reason why the SVD method required many more iterations than the regression method.

With a relatively smaller learning rate, the gradient descent required more time to locate the

minimum.

Number of Genres

One of the unique features of these recommender algorithms is the ability to decide how

many genres should fully characterize the movies in the data set. This is a fairly arbitrary decision

but it has implications for the performance of the models, as shown in Figure 5. Most notably, the

RMSE decreases in the number of genres, reaching its minimum when the number of genres equals

the number of movies. So what is the advantage of choosing fewer genres than movies? First, we

want to reduce computation time. In a matrix with thousands of movies, reducing the movies to a

few dozen genres can result in enormous time savings in practical implementations of these

algorithms. Second, we want to label movies. Rather than pay employees to research movies in

order to determine their genre designations, a business can instead allow the computer to discover

the genres after someone selects how many genres are likely to be present in the data set. Then the

movies can be labeled appropriately with genres for which they score highly. Third, and most

important for recommendation systems, we want to avoid the problem of overfitting. If we let

every movie represent its own â€œgenreâ€ then we are creating a system that makes good predictions

for that particular matrix, but will not generalize well to other rating matrices, or may perform a

good deal more poorly when new movies and users are added. And of course, in the case of the

8

SVD approach, the method has a built-in mechanism for discovering the most important genres.

After a point, there will be rapidly diminishing marginal benefits to additional genres.

Predictive Performance

When we use the right model parameters, we see a decrease in root-mean-squared error. But

the RMSE is just a number and can feel abstract. Instead, it is helpful to visualize how the right

choice of parameters leads to a better rating prediction. Figure 6 plots predicted ratings versus

actual user ratings for both â€œgoodâ€ parameters and â€œbadâ€ parameters. In both the regression and

SVD approaches, we see that bad parameters, leading to a high RMSE, create inaccurate predictions

with a wide variance. In contrast, good parameters allow predicted and actual ratings to align so that

they are nearly indistinguishable. In this case we can be confident that our predictions are accurate,

at least for movies that the user has already rated: the predictions conform exceedingly well with the

true values of known ratings. For movies without known ratings, the predictions can take on any

value in the rating scale, including non-integer numbers, and so do not indicate the exact star rating

the user would give (because the user is limited to an integer selection) but rather indicate the degree

to which a user is expected to enjoy a film. If a user were to watch one of these films and rate it, a

new implementation of the algorithms would use this information to make all the predictions more

accurate. Thus, in these algorithms, every user unknowingly collaborates to produce better

recommendations for everyone.

Genre Space

An underlying principle in recommender dimensionality reduction is that movies will tend

to â€œclusterâ€ together in genre-space. That is, movies that share similar features will be described by

similar genre vectors. If we could visualize genre-space, we would see similar films grouping

together. Fortunately we can get a glimpse of genre space by running the algorithms, setting the

number of genres to two, and plotting the results on a two-way scatterplot. Then each dimension

would act as a kind of aggregate genre that describes the films. Figures 7 and 8 plot the films by

their values of aggregate genre. While it is difficult to interpret exactly what these two genres really

represent, we are more interested in the clustering effects.

Indeed, we see this clustering to some extent in Figure 7 (note that the axes merely represent

relative levels, so they can take on any value). Both The Godfather and The Godfather: Part II fall on the

far left, while both Star Trek III and Star Trek VI fall on the far right. Moreover, the comedies Life of

Brian and Raising Arizona both appear in the middle. With a few small differences, we see similar

movie groupings in Figure 8, which comes from the SVD approach. In fact, Figure 8 looks like

nothing more than a 90-degree clockwise rotation of Figure 7, with a few tweaks, suggesting that the

two algorithms are discovering the same genres.

We must be careful how we interpret these plots when we use only two genres. Did The

Lion King and Jaws group together because they are both about ferocious animals? It would be

humorous, but probably untrue. Did Twelve Monkeys and Back to the Future group together because

they are both about time travel? Possibly. We would need to compare them in higher dimensions

to know for sure. And perhaps with more genres we would see The Lion King and Babe pull away

from the more serious and adult-oriented movies. Nevertheless, the clustering effects we see here

indicate that users with similar tastes rate movies similarly, allowing the algorithm to separate movies

into different categories based on this implicit collaboration effect. This is the true secret of the

matrix factorization approach!

9

Recommendations

We cannot venture into recommender systems without finally making recommendations. In

order to demonstrate how the algorithms provide predicted ratings, I generated a small rating matrix

of six movies and eight users from the MovieLens data (Table 1). Some users have rated only one

movie, some have rated a few, and one has rated all of them. The results of the predictions are

presented in Tables 2 and 3. The most striking feature is how closely the predictions match the

actual ratings for both methods, though linear regression appears to have done slightly better.

Whether the user has rated only one movie or all of them, the predictions are impressively accurate.

There is, however, a difference in how the two methods predict ratings for movies the users have not

seen. For example, regression predicts user H will give Top Gun a rating of 1.000, while SVD

predicts a 3.439 for the same. In effect, one method would tell user H that Top Gun will be his least

favorite, and the other method would tell him it will be his favorite among those not yet rated. This

is a curious result, especially given the similar genre plots in Figures 7 and 8. This result implies that

the two algorithms have found different matrices that provide similarly good predictions for alreadyrated movies but different predictions for unrated movies. It seems that, because the two methods

begin from dramatically different component matrices, the gradient descent has identified two

different local minima.

Returning to the original 20×200 rating matrix, we can again see the difference in the two

methods in Table 4. For the same user, the two methods have mixed opinions about what the user

will enjoy. In some cases the predictions are similar, in order if not in absolute rating, while in other

cases the algorithms make rather different predictions. Again, this result may occur because the

methods have settled into different minima. In fact, neither method may be fully correct. It is

common in these kinds of machine learning problems to combine or average the results of many

different models. It may be that the best predictions come from a blend of both the regression and

SVD approaches, but that is an avenue for future work.

Summary and Conclusion

The goal of this project was to explore two different but related methods for building a

recommender system around user ratings for movies. The objective was to make accurate

predictions of already-rated movies under the assumption that the accompanying predictions for

unrated movies would also be accurate. One approach was linear regression with simultaneous

parameter estimation. It began with genre and preference matrices initialized to small random

values and then used gradient descent in order to minimize the sum of squared errors. The other

approach was a singular value decomposition of the original rating matrix and the selection of the

most important genres discovered by the decomposition. The predictions that resulted from the

SVD were then updated with another form of gradient descent.

The two methods performed well but differed in some of their recommendations. This

result indicates that the algorithms settled at different local equilibria. It is possible that one is

â€œmore correctâ€ than the other, or that they may both be two pieces of the same puzzle whose

solution is more complete with a combination of the methods. Indeed, the winners of the Netflix

Prize Contest ultimately used an ensemble of 107 different methods to make the winning

predictions. Machine learning tools such as my algorithms are rarely complete on their own.

Moreover, a larger computer with more processing power would be able to deal with a larger rating

matrix and take advantage of the additional information afforded by including more users and

movies. Nevertheless, this experiment has laid the foundations for a reasonably reliable

recommender system that employs the collaboration of individual users for the benefit of all.

10

Figure 1. Effect of matrix size on computation time.

Figure 2. Effect of number of iterations on computation.

11

Figure 3. RMSE for different number of iterations of gradient descent.

Figure 4. RMSE for different learning rates.

12

Figure 5. RMSE for different number of genres.

Figure 6. Predicted and actual ratings for bad and good models.

13

Figure 7. Movies in genre-space, regression approach.

14

Figure 8. Movies in genre-space, SVD approach.

15

Table 1. Sample data set with actual user ratings.

Movie \ User A B C D E F G H

Phenomenon 3 3 4

Sneakers 3 2

Titanic 5 4 4 4 5

Top Gun 2 5

Cool Hand Luke 4 3 4

Return of the Jedi 4 5 5 5 2 3

Table 2. Prediction results from regression approach (already-rated movies in bold).

Movie \ User A B C D E F G H

Phenomenon 2.675 3.571 2.995 3.305 2.999 3.999 2.063 2.841

Sneakers 2.502 3.320 2.624 3.118 2.996 3.426 1.998 2.719

Titanic 3.127 4.996 3.838 3.944 4.000 3.997 3.982 5.000

Top Gun 2.545 2.150 2.421 3.008 2.001 4.995 1.000 1.000

Cool Hand Luke 3.184 4.465 3.401 3.997 4.000 4.092 3.024 3.974

Return of the Jedi 3.997 4.481 3.265 4.996 4.996 5.000 1.993 3.008

Table 3. Prediction results from SVD approach (already-rated movies in bold).

Movie \ User A B C D E F G H

Phenomenon 3.364 3.672 3.000 3.503 2.885 3.943 2.937 3.417

Sneakers 2.532 2.829 2.293 2.729 2.909 2.371 1.887 2.606

Titanic 4.380 4.999 3.932 4.287 4.112 4.022 4.050 5.000

Top Gun 3.506 3.714 3.078 3.570 2.023 5.000 3.441 3.439

Cool Hand Luke 3.699 4.155 3.338 3.867 3.955 3.486 2.986 3.943

Return of the Jedi 3.999 4.140 3.626 4.999 5.000 5.000 2.028 3.013

16

Table 4. User preferences and ordered predictions with rating in parentheses.

User Preference

(Actual Rating)

Regression Prediction

(Predicted Rating)

SVD Prediction

(Predicted Rating)

User 1

The Godfather (5) Leaving Las Vegas (5.000) Star Trek VI (4.730)

The Godfather II (5) Lone Star (4.016) Primal Fear (4.618)

True Lies (5) Babe (3.934) Babe (4.486)

Goodfellas (4) Lion King (3.857) Die Hard (4.076)

Back to the Future (4) Primal Fear (2.915) Raising Arizona (3.757)

Silence of the Lambs (4) Star Trek VI (2.293) Lone Star (3.559)

Jaws (4) Die Hard II (2.245) Star Trek III (3.038)

First Wives Club (3) Raising Arizona (2.164) Leaving Las Vegas (1.787)

Twelve Monkeys (3) Life of Brian (1.932) Lion King (1.000)

Dr. Strangelove (2) Star Trek III (1.609) Life of Brian (1.000)

User 2

The Godfather (5) Babe (5.000) Leaving Las Vegas (5.000)

The Godfather II (5) Lion King (4.504) Raising Arizona (5.000)

Jaws (5) Lone Star (4.273) Back to the Future (5.000)

Goodfellas (4) Dr. Strangelove (4.067) Lone Star (4.482)

True Lies (4) Back to the Future (3.689) Primal Fear (3.890)

Silence of the Lambs (4) Leaving Las Vegas (3.577) Babe (3.083)

Die Hard II (3) Life of Brian (3.462) Lion King (3.071)

Twelve Monkeys (3) First Wives Club (2.303) Life of Brian (2.647)

Star Trek III (2) Raising Arizona (1.969) First Wives Club (2.640)

Star Trek VI (2) Primal Fear (1.737) Dr. Strangelove (2.574)

User 3

Dr. Strangelove (5) Life of Brian (5.000) Lion King (5.000)

Twelve Monkeys (5) Goodfellas (4.939) Life of Brian (5.000)

The Godfather (4) The Godfather II (3.755) The Godfather II (3.403)

Back to the Future (4) True Lies (3.379) Goodfellas (3.382)

Silence of the Lambs (4) Die Hard II (3.219) Lone Star (2.699)

Jaws (4) Lion King (3.190) Star Trek III (2.660)

Raising Arizona (3) Lone Star (2.947) True Lies (2.496)

Leaving Las Vegas (3) First Wives Club (2.057) Primal Fear (2.481)

Babe (2) Primal Fear (1.422) First Wives Club (2.031)

Star Trek VI (1) Star Trek III (1.000) Die Hard II (1.340)

17

APPENDIX: CODE

QUICK GUIDE TO R

Symbol/Term Meaning

# Comment character

<- Stores value to name on left

$ Calls sub-object from object

( ) Denotes quantity or the use of a function

[ ] Subsets vector or matrix

[[ ]] Subsets a list

{ } Denotes boundary of the body of a function or loop

= Sets levels of arguments of a function

== Logical, equal to

!= Logical, not equal to

: Creates sequence of integers, e.g. 1:20 is 1, 2, â€¦, 20

* Element-wise multiplication

%*% Matrix multiplication

… Passes additional arguments to functions downstream

c( ) Creates vector with supplied values

lapply/sapply Loops over first argument using the following function

##############################################

#### FUNCTION TO RETRIEVE RATING MATRIX ####

##############################################

# film.list = the list of all films (mov.list)

# min.n.ratings = the minimum number of ratings a movie can have to be included

# n.films = number of films in final data set

# n.users = number of users in final data set

# rows = which variable, movies or users, should be the “observations” i.e. rows

# This function will never generate the same data set twice.

get.pops <- function(film.list, min.n.ratings, n.films, n.users, rows = “movie”){

# which movies have >= min.n.ratings

popular.i <- which(sapply(film.list, nrow) >= min.n.ratings)

# sample from them to get random subset of most popular movies

sample.pop <- sample(popular.i, n.films, replace = FALSE)

# get smaller list of movies we want

new.list <- film.list[sample.pop]

# bind them into a single data set (no longer a list)

long.frame <- do.call(rbind, new.list)

# get wide data frame by reshaping

wide.frame <- reshape(long.frame, timevar = “movie_id”, idvar = “user”, direction = “wide”,

drop = “time”)[, -1]

# subset to desired number of users

user.index <- sample(1:nrow(wide.frame), n.users, replace = FALSE)

wide.frame <- wide.frame[user.index, ]

# return the results

if(rows == “movie”) return(t(wide.frame)) else

if(rows == “user”) return(wide.frame)

}

#############################################

#### FUNCTION TO RETRIEVE MOVIE TITLES ####

#############################################

18

# This function takes the matrix of ratings and the title data set

# and matches them so I know which movies were selected.

get.titles <- function(rating.matrix, title.frame){

# get character vector of row names from rating matrix

row.names <- rownames(rating.matrix)

# pull off the movie ID number from the row names

ids <- gsub(“rating.”, “”, row.names)

# match this ID number with the movie title from the title data set

index <- sapply(ids, function(i) which(title.frame$movie_id == i))

# return the results

title.frame$title[index]

}

############################################

#### FUNCTION TO MAKE RECOMMENDATIONS ####

############################################

# This function takes the matrix of predictions and recommends movies

# sorted by level of predicted enjoyment. It determines which movies

# the user has seen, and which the user has not. It returns, for each user,

# the movies that were seen, and those that were not in order of prediction.

recommend <- function(result, setup){

# Which movies the user has seen

seen <- lapply(1:ncol(setup$Ind), function(i) which(setup$Ind[,i]!=0))

# What the user rated those movies

rated <- lapply(1:length(seen), function(i) setup$Y[,i][seen[[i]]])

# The order of those movies by rating (i.e. by preference)

pref.order <- lapply(rated, sort, decreasing = TRUE)

# Which movies the user has not seen

unseen <- lapply(1:ncol(setup$Ind), function(i) which(setup$Ind[,i]==0))

# What the function predicted for movies not seen

predicted <- lapply(1:length(unseen), function(i) result$pred[,i][unseen[[i]]])

# The order of those movies by prediction

pred.order <- lapply(predicted, sort, decreasing = TRUE)

# Return the results in a list, one element per user

lapply(1:length(seen), function(i) list(pref.order[[i]], pred.order[[i]]))

}

###################################################

#### FUNCTIONS TO PLOT MOVIE TITLES BY GENRE ####

###################################################

# These functions perform the algorithms with 2 genres.

# Movies are then plotted in genre-space in order to

# see how the algorithms discover latent movie characteristics.

gplot.reg <- function(reg.setup, …){

results <- regression(reg.setup, 2, …)

dev.new()

plot(results$G[,1], results$G[,2], xlab = “Genre 1”, ylab = “Genre 2”,

main = “Movies by Value of Latent Genres, Regression”)

text(results$G[,1], results$G[,2], rownames(results$G), cex = 0.7, pos = 1)

}

gplot.svd <- function(svd.setup, …){

results <- svd.descent(svd.setup, 2, …)

19

dev.new()

plot(results$G[,1], results$G[,2], xlab = “Genre 1”, ylab = “Genre 2”,

main = “Movies by Value of Latent Genres, SVD”)

text(results$G[,1], results$G[,2], rownames(results$G), cex = 0.7, pos = 1)

}

############################################

#### FUNCTIONS FOR COMPUTATIONAL COST ####

############################################

# Regression computation time for matrices of different sizes

time.dim.reg <- function(film.list, max.dim = 100, …){

# create vector of desired matrix dimensions

dims <- seq(20, max.dim, by = 10)

# make largest rating matrix

maxdim <- max(dims)

data.big <- get.pops(film.list, min.n.ratings = 216, 20, maxdim)

# make smaller rating matrices as subsets of largest, looping over dims

ratings <- lapply(dims, function(dim) data.big[, 1:dim])

# make list of setups

setups <- lapply(ratings, function(i) reg.setup(i, rownames(i)))

# run regression for each size of rating matrix, timing each

times <- sapply(setups, function(i)

system.time(regression(i, n.genres = 10, …))[3])

# return data frame with matrix dimension and time elapsed

data.frame(dim = dims, seconds = times)

}

# Regression computation time for different numbers of iterations

time.its.reg <- function(setup, max.its = 200000, …){

# Create vector of number of iterations

its <- seq(10000, max.its, by = 10000)

# Run regression for each number of iterations, timing each

times <- sapply(its, function(i)

system.time(regression(setup, n.genres = 10, n.iters=i))[3])

# Return results

data.frame(its = its, seconds = times)

}

# SVD computation time for matrices of different sizes

time.dim.svd <- function(film.list, max.dim = 100, …){

# create vector of desired matrix dimensions

dims <- seq(20, max.dim, by = 10)

# make largest rating matrix

maxdim <- max(dims)

data.big <- get.pops(film.list, min.n.ratings = 216, 20, maxdim)

# make smaller rating matrices as subsets of largest, looping over dims

ratings <- lapply(dims, function(dim) data.big[, 1:dim])

# make list of setups

setups <- lapply(ratings, function(i) svd.setup(i, rownames(i)))

# run regression for each size of rating matrix, timing each

times <- sapply(setups, function(i)

system.time(svd.descent(i, n.genres = 10, …))[3])

20

# return data frame with matrix dimension and time elapsed

data.frame(dim = dims, seconds = times)

}

# Regression computation time for different numbers of iterations

time.its.svd <- function(setup, max.its = 200000, …){

# Create vector of number of iterations

its <- seq(10000, max.its, by = 10000)

# Run regression for each number of iterations, timing each

times <- sapply(its, function(i)

system.time(svd.descent(setup, n.genres = 10, n.iters=i))[3])

# Return results

data.frame(its = its, seconds = times)

}

####################################################

#### FUNCTIONS FOR LINEAR REGRESSION APPROACH ####

####################################################

# This function establishes indicator matrix (1 if movie has been rated, else 0)

# and special rating matrix (0 if missing value, else actual rating)

# to be used for later calculations.

reg.setup <- function(data, data.titles){

# Ensure data is the right kind of object (matrix)

if(class(data) != “matrix”) Y <- as.matrix(data)

# Get number of users and movies from matrix

n.movies <- nrow(data)

n.users <- ncol(data)

# Create indicator matrix, 1 if rated, 0 if not rated

Ind <- ifelse(is.na(data), 0, 1)

# Replace missing (non-rated) values in rating matrix with 0

Y <- ifelse(is.na(data), 0, data)

# Name the rows with the titles so they can be used later

rownames(Ind) <- data.titles

rownames(Y) <- data.titles

# Return a list with results

list(nm = n.movies, nu = n.users, Ind = Ind, Y = Y)

}

# This function performs the linear regression with gradient descent.

regression <- function(setup, n.genres, alpha = 0.001, lambda = 0.005, n.iters = 10000){

# Restrict number of genres to be no greater than number of movies

if(n.genres > nrow(setup$Y)) stop(“Cannot have more genres than movies.”)

# Initialize parameters to small random values

G <- matrix(runif(setup$nm * n.genres, -0.1, 0.1), nrow = setup$nm)

P <- matrix(runif(setup$nu * n.genres, -0.1, 0.1), nrow = setup$nu)

# Shorten some names used later

Y <- setup$Y

Ind <- setup$Ind

# Perform iterative gradient descent

for(i in 1:n.iters){

GPY <- G %*% t(P) – Y

G <- G – alpha * ((GPY * Ind) %*% P + lambda * G)

P <- P – alpha * ((t(GPY) * t(Ind)) %*% G + lambda * P)

21

}

# Compute final prediction, bounding predictions between 1 and 5

pred <- G %*% t(P)

pred <- ifelse(pred < 1, 1, pred)

pred <- ifelse(pred > 5, 5, pred)

# Compute RMSE

rmse <- sqrt(1/(sum(Ind))*sum(((pred – setup$Y)^2) * Ind))

# Return list with resutls

list(G = G, P = P, pred = pred, rmse = rmse)

}

######################################

#### FUNCTIONS FOR SVD APPROACH ####

######################################

# Unless otherwise annotated, these work similarly to

# the functions for linear regression.

svd.setup <- function(data, data.titles){

if(class(data) != “matrix”) Y <- as.matrix(data)

n.movies <- nrow(data)

n.users <- ncol(data)

Ind <- ifelse(is.na(data), 0, 1)

# compute mean rating for each movie

means <- rowMeans(data, na.rm = TRUE)

# for each movie replace missing values with mean rating for that movie

Y <- t(sapply(1:nrow(data), function(i)

ifelse(is.na(data[i,]), means[i], data[i,])))

rownames(Ind) <- data.titles

rownames(Y) <- data.titles

list(nm = n.movies, nu = n.users, Ind = Ind, Y = Y)

}

svd.descent <- function(setup, n.genres, alpha = 0.00001, n.iters=50000) {

if(n.genres > nrow(setup$Y)) stop(“Cannot have more genres than movies.”)

# Start with a singular value decomposition of the rating matrix

svd.mat <- svd(setup$Y)

# Choose the most important movie genres and sigmas as determined by the SVD

G <- svd.mat$u[, 1:n.genres]

P <- svd.mat$v[, 1:n.genres]

sigma <- diag(svd.mat$d[1:n.genres])

# Multiply the movie genre and sigma matrices together.

# We have now established our initial genre and preference

# matrices, G.s and P, for gradient descent.

G.s <- G %*% sigma

Ind <- setup$Ind

for(i in 1:n.iters){

error <- (G.s %*% t(P) – setup$Y) * Ind

G.s <- G.s – alpha * (error %*% P)

P <- P – alpha * (t(error) %*% G.s)

}

pred <- G.s %*% t(P)

pred <- ifelse(pred < 1, 1, pred)

pred <- ifelse(pred > 5, 5, pred)

rmse <- sqrt(1/(sum(Ind))*sum(((pred – setup$Y)^2) * Ind))

list(G = G.s, P = P, pred = pred, rmse = rmse)

}

22

############################

#### SET UP WORKSPACE ####

############################

# Set working directory where files will be stored

setwd(“C:/Users/Adam/Desktop/biostat/mat167/project/”)

# Read in movie data (creates data set from Movie Lens data)

mov <- read.table(“movie_lens_cols.txt”)

# Change names of variables

names(mov) <- c(“user”, “movie_id”, “rating”, “time”)

# Read in movie titles (creates data set of titles from Movie Lens data)

all.titles <- read.table(“titles.txt”, sep=”|”, quote=””)

all.titles <- titles[, 1:2] # grab only the relevant info: move ID and title

names(all.titles) <- c(“movie_id”, “title”)

# Make list of ratings by movie

# (in other words separate user ratings into groups by movie)

mov.list <- lapply(unique(mov$movie_id), function(id) subset(mov, movie_id==id))

# Now I use a custom function to get a rating matrix from among

# the most-rated movies. I use most-rated movies because I take

# a random subset of movies and users due to computational limitations, and

# I require at least one rating for each movie and user. A more dense

# matrix is easier to work with the purpose of this project.

ratings <- get.pops(mov.list, min.n.ratings=150, n.films=20, n.users=200)

# Now I use a custom function to get the titles of the selected movies.

titles <- get.titles(ratings, all.titles)

# Use custom functions to set up data that will be passed to

# my algorithms that perform the regression and SVD approaches.

for.reg <- reg.setup(ratings, titles)

for.svd <- svd.setup(ratings, titles)

# Perform small test example with a few users and movies

# for expository purposes

ratings.small <- get.pops(mov.list, 150, 6, 8)

titles.small <- get.titles(ratings.small, all.titles)

for.reg.small <- reg.setup(ratings.small, titles.small)

for.svd.small <- svd.setup(ratings.small, titles.small)

regression(for.reg.small, 3, n.iters = 20000)$pred

svd.descent(for.svd.small, 3, n.iters = 100000)$pred

############################

#### COMPUTATION TIME ####

############################

# Time regression and SVD approach on rating matrices with different number

# of users to determine matrix size on computational cost.

reg.times.dim <- time.dim.reg(mov.list, max.dim = 500)

svd.times.dim <- time.dim.svd(mov.list, max.dim = 500, n.iters=10000)

reg.times.its <- time.its.reg(for.reg)

svd.times.its <- time.its.svd(for.svd)

# Plot the results

plot(reg.times.dim, type = “l”, xlab = “number of users”,

ylab = “seconds to completion”, lwd = 3,

main = “Time for 10,000 Iterations with 20 Movies”, ylim = c(0, 14))

lines(svd.times.dim, lty = 2, lwd = 3)

legend(“topleft”, legend=c(“Regression”, “SVD”), lty=c(1, 2), lwd=2)

plot(reg.times.its, type = “l”, xlab = “number of iterations”,

23

ylab = “seconds to completion”, lwd = 3,

main = “Time for 200 Users and 20 Movies”, ylim = c(0, 120))

lines(svd.times.its, lty = 2, lwd = 3)

legend(“topleft”, legend=c(“Regression”, “SVD”), lty=c(1, 2), lwd=2)

#############################################

#### PLOT MOVIE TITLES BY LATENT GENRE ####

#############################################

# Use custom function to plot movies in genre space

gplot.reg(for.reg)

gplot.svd(for.svd)

#######################################################

#### BEHAVIOR OF RMSE BY DIFFERENT NUMBER GENRES ####

#######################################################

# Create vector of genres from 2 to 20

genres <- 2:20

# Fix function defaults, but loop over different number of genres

rmse.genre.reg <- sapply(genres, function(i) regression(for.reg, i)$rmse)

rmse.genre.svd <- sapply(genres, function(i) svd.descent(for.svd, i)$rmse)

plot(genres, rmse.genre.reg, xlab = “number of genres”, ylab = “RMSE”,

main = “RMSE by Number of Genres”, type = “l”, lwd = 3)

lines(genres, rmse.genre.svd, lty = 2, lwd = 3)

legend(“topright”, legend=c(“Regression”, “SVD”), lty=c(1, 2), lwd=2)

#############################################

#### BEHAVIOR OF RMSE BY LEARNING RATE ####

#############################################

# Create vector of learning rates appropriate to each method

alphas.reg <- seq(0.0005, 0.0035, by = 0.00025)

alphas.svd <- seq(2e-5, 9e-5, by = 1e-5)

# Fix function defaults, but loop over different learning rates

rmse.alpha.reg <- sapply(alphas.reg, function(i) regression(for.reg, 10, alpha = i)$rmse)

rmse.alpha.svd <- sapply(alphas.svd, function(i) svd.descent(for.svd, 10, alpha = i)$rmse)

plot(alphas.reg, rmse.alpha.reg, xlab = “Learning rate”, ylab = “RMSE”,

main = “RMSE by Learning Rate, Regression”, type = “l”, lwd = 3)

plot(alphas.svd[-8], rmse.alpha.svd[-8], xlab = “Learning rate”, ylab = “RMSE”,

main = “RMSE by Learning Rate, SVD”, type = “l”, lwd = 3, lty=2)

########################################################

#### BEHAVIOR OF RMSE BY REGULARIZATION PARAMETER ####

########################################################

# Create vector of regulariazation parameters

lambdas <- seq(0.1, 3, by = 0.1)

# Fix alpha = 0.001, genres = 10, iterations = 10000

rmse.lambda.reg <- sapply(lambdas, function(i) regression(for.reg, 10, lambda = i)$rmse)

plot(lambdas, rmse.lambda.reg, xlab = “Value of regularization parameter”, ylab = “RMSE”,

main = “RMSE by Regularization Parameter, Regression”, type = “l”, lwd = 3)

####################################################

#### BEHAVIOR OF RMSE BY NUMBER OF ITERATIONS ####

####################################################

# Create vector of number of iterations with the goal of plotting on same graph

its.reg <- c(seq(1000, 20000, by = 2000), seq(40000, 200000, by = 20000))

its.svd <- seq(10000, 200000, by = 10000)

24

# Fix function defaults, but loop over different number of iterations

rmse.its.reg <- sapply(its.reg, function(i) regression(for.reg, 10, n.iters = i)$rmse)

rmse.its.svd <- sapply(its.svd, function(i) svd.descent(for.svd, 10, n.iters = i)$rmse)

plot(its.reg, rmse.its.reg, xlab = “Number of iterations”, ylab = “RMSE”,

main = “RMSE by Number of Iterations”, type = “l”, lwd = 3,

ylim = c(0.05, max(rmse.its.svd)))

lines(its.svd, rmse.its.svd, lty=2, lwd=3)

legend(“topright”, legend = c(“Regression”, “SVD”), lty = c(1, 2), lwd = 2)

##################################################

#### DISTRIBUTION OF RMSE FOR A GIVEN MODEL ####

##################################################

# Compute RMSE multiple times to get distribution

rmse.dist.reg <- sapply(1:200, function(i) regression(for.reg, 10)$rmse)

plot(density(rmse.dist.reg), xlab = “Value of RMSE”,

main = “Sample Distibution of RMSE for a Given Model, Regression”, lwd = 3)

###############################################

#### PREDICTED VALUES VERSUS TRUE VALUES ####

###############################################

# REGRESSION

# Choose a bad model and a good model and plot prediction versus actual

model.bad.reg <- regression(for.reg, 2, alpha = 0.001, lambda = 0.01, n.iters = 100)

model.good.reg <- regression(for.reg, 15, alpha = 0.0025, lambda = 0.005, n.iters = 10000)

plot(ratings[2,], model.bad.reg$pred[2,], xlab = “Actual rating”, ylab = “Predicted rating”,

main = “Poor Model Choice, Regression (Lion King)”)

plot(ratings[2,], model.good.reg$pred[2,], xlab = “Actual rating”, ylab = “Predicted rating”,

main = “Good Model Choice, Regression (Lion King)”)

# SVD

# Choose a bad model and a good model and plot prediction versus actual

model.bad.svd <- svd.descent(for.svd, 2, alpha = 1e-6, n.iters = 10000)

model.good.svd <- svd.descent(for.svd, 15, alpha = 6e-5, n.iters = 200000)

plot(ratings[2,], model.bad.svd$pred[2,], xlab = “Actual rating”, ylab = “Predicted rating”,

main = “Poor Model Choice, SVD (Lion King)”)

plot(ratings[2,], model.good.svd$pred[2,], xlab = “Actual rating”, ylab = “Predicted rating”,

main = “Good Model Choice, SVD (Lion King)”)

################################

#### MAKE RECOMMENDATIONS ####

################################

# Store prediction results for regression and SVD approaches

result.reg <- regression(for.reg, 10)

result.svd <- svd.descent(for.svd, 10, n.iters = 100000)

# Store recommendations for all users

rec.reg <- recommend(result.reg, for.reg)

rec.svd <- recommend(result.svd, for.svd)

# Pick a few users and look at recommendations by each approach

rec.reg[c(29, 33, 38)]

rec.svd[c(29, 33, 38)]