The NMF Procedure

Example 15.2 Making Recommendations Using Matrix Completion

This example illustrates how you can use the NMF procedure to build a simple recommender system that aims at predicting the preference (or rating) of a user for an item and making recommendations based on the prediction. The data in this example are derived from the MovieLens data set, which was developed by the GroupLens project at the University of Minnesota and is available at http://grouplens.org/datasets/movielens. This example uses the MovieLens 100K version. You can download the compressed archive file from the website at http://files.grouplens.org/datasets/movielens/ml-100k.zip and use any third-party unzip tool to extract all the files in the archive to the destination directory of your choice.[1]

The file that contains the movie ratings is u.data, which lists four columns: user ID, item ID (each item is a movie), rating, and timestamp. The data set is very sparse because most combinations of users and movies are not rated. Assuming that the file is located in the directory /path/to/your/directory, the following statements invoke the PYTHON procedure to load the data from the directory, store it in dense matrix format, and transfer it to a SAS data table named mycas.ratings in your CAS session. Here, the SAS.df2sd callback method is used in PROC PYTHON to transfer data from a Python Pandas DataFrame to a SAS data set; the CAS engine libref is named mycas, but you can substitute any appropriately defined CAS engine libref. The mycas.ratings data table contains the following columns: UserID and M1, M2, M3, etc. (which correspond to all the movies). Each row in the data table contains the ratings that a certain user gave to the movies. If the user did not rate a movie, the corresponding rating is missing.

proc python;
   submit;

import pandas as pd
import numpy as np

# load data from the file
colname = ['userID', 'movieID', 'rating']
colpick = [0, 1, 2]
df = pd.read_csv('u.data', delimiter='\t', usecols=colpick, names=colname)

# store data in dense matrix format
nrow = max(df.loc[:, 'userID'])
ncol = max(df.loc[:, 'movieID']) + 1
mat = np.full((nrow, ncol), np.nan)

for i in range(0, nrow):
   mat[i, 0] = i+1

for idx, rowSeries in df.iterrows():
   val = rowSeries.values
   mat[val[0]-1, val[1]] = val[2]

# transfer data to a SAS data table in CAS
cols = ['UserID'] + ['M%d' %i for i in range(1, ncol)]
matdf = pd.DataFrame(mat, columns=cols)

SAS.df2sd(matdf, 'mycas.ratings')

   endsubmit;
run;

You specify the IMPUTE statement in the NMF procedure to enable low-rank matrix completion to recover the missing entries in the mycas.ratings data table. The following statements invoke the NMF procedure for this data table and output the imputation results to the output data table mycas.outX. Because the file that contains the movie genres, u.genre, lists 19 genres, the PROC NMF statement specifies RANK=19 to compute 19 feature vectors during the factorization. The mycas.ratings data table is very sparse, so to ensure convergence for low-rank matrix completion and mitigate the impact of the initial factor matrices on the imputation results, the PROC NMF statement specifies 600 as the maximum number of iterations for the APG method by using the MAXITER= suboption and requests the upper L 2-norm regularization method by using the REG= option. The IMPUTEDROWSONLY and PREDONLY options in the IMPUTE statement keep only the rows that contain the imputed values in the output data table mycas.outX and set the observed values in those rows as missing values.

proc nmf data=mycas.ratings rank=19 seed=6789
         method=apg(maxiter=600) reg=L2(alpha=5 beta=5);
   var m:;
   impute out=mycas.outX imputedRowsOnly predOnly copyvar=UserID;
run;

The following statements invoke PROC PYTHON to fetch from the mycas.outX data table the first 10 observations that contain the predicted ratings of the first 10 users (the SAS.sd2df callback method is used in PROC PYTHON to transfer data from a SAS data set to a Python Pandas DataFrame); load information about the movies from the file u.item that is located in the directory /path/to/your/directory; produce the top 10 recommended movies for the 9th user; and generate a table that contains the top 5 recommended movies for each of the first 10 users:

proc python;
   submit;

import pandas as pd
import numpy as np
import csv

# fetch the first 10 observations
df = SAS.sd2df('mycas.outX(obs=10)')

# load information about the movies
movieDict = {}
csvFile = csv.reader(open('u.item', encoding='latin-1'), delimiter='|')
for row in csvFile:
   key = 'M' + row[0]
   movieDict[key] = row[1]

# top 10 recommended movies for a single user
row = 8
uid = df.iloc[row, 0]
rating = df.iloc[row, 1:].sort_values(ascending=False, inplace=False)
colUid = [uid]*10
colRank = np.arange(1, 11).tolist()
colMid = rating.index.tolist()[0:10]
colRate = rating.values.tolist()[0:10]
colTitle = []
for i in range(0, 10):
   colTitle.append(movieDict[rating.index[i]])

cols = ['UserID', 'Rank', 'MovieID', 'Title', 'PredictedRating']
topRating = pd.DataFrame(list(zip(colUid, colRank, colMid, colTitle, colRate)),
                         columns=cols)
SAS.df2sd(topRating, 'topRating')

# top 5 recommended movies for each of the 10 users
movies = []
for idx, rowSeries in df.iterrows():
   uid = rowSeries.pop('UserID')
   rowSeries.sort_values(ascending=False, inplace=True)
   row = [uid]
   for i in range(0, 5):
      row.append(movieDict[rowSeries.index[i]])
   movies.append(row)

cols = ['UserID', '_1_', '_2_', '_3_', '_4_', '_5_']
topRecom = pd.DataFrame(movies, columns=cols)
SAS.df2sd(topRecom, 'topRecom')

   endsubmit;
run;

The following statements print the top 10 recommended movies along with the predicted ratings for the 9th user, as shown in Output 15.2.1:

proc print data=topRating;
run;

Output 15.2.1: Top 10 Recommended Movies with Predicted Ratings for Single User

Obs UserID Rank MovieID Title PredictedRating
1 25 1 M313 Titanic (1997) 4.67129
2 25 2 M272 Good Will Hunting (1997) 4.59829
3 25 3 M302 L.A. Confidential (1997) 4.50687
4 25 4 M318 Schindler's List (1993) 4.49296
5 25 5 M64 Shawshank Redemption, The (1994) 4.46565
6 25 6 M100 Fargo (1996) 4.33774
7 25 7 M515 Boot, Das (1981) 4.33103
8 25 8 M172 Empire Strikes Back, The (1980) 4.31865
9 25 9 M483 Casablanca (1942) 4.30335
10 25 10 M12 Usual Suspects, The (1995) 4.30271


The following statements print the top 5 recommended movies (sorted by descending predicted ratings) for each of the 10 users, as shown in Output 15.2.2:

proc print data=topRecom;
run;

Output 15.2.2: Top 5 Recommended Movies for 10 Users

Obs UserID _1_ _2_ _3_ _4_ _5_
1 1 Close Shave, A (1995) Casablanca (1942) Secrets & Lies (1996) L.A. Confidential (1997) Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)
2 4 Titanic (1997) Good Will Hunting (1997) L.A. Confidential (1997) Full Monty, The (1997) Apt Pupil (1998)
3 7 Wrong Trousers, The (1993) Close Shave, A (1995) Wallace & Gromit: The Best of Aardman Animation (1996) Philadelphia Story, The (1940) Some Folks Call It a Sling Blade (1993)
4 10 Schindler's List (1993) Close Shave, A (1995) Good Will Hunting (1997) Boot, Das (1981) To Kill a Mockingbird (1962)
5 13 Vertigo (1958) Close Shave, A (1995) Wrong Trousers, The (1993) Citizen Kane (1941) Killing Fields, The (1984)
6 16 Star Wars (1977) Titanic (1997) Good Will Hunting (1997) Casablanca (1942) Return of the Jedi (1983)
7 19 Star Wars (1977) Good Will Hunting (1997) L.A. Confidential (1997) Schindler's List (1993) Godfather, The (1972)
8 22 Usual Suspects, The (1995) Shawshank Redemption, The (1994) Schindler's List (1993) Close Shave, A (1995) Fargo (1996)
9 25 Titanic (1997) Good Will Hunting (1997) L.A. Confidential (1997) Schindler's List (1993) Shawshank Redemption, The (1994)
10 28 Titanic (1997) Good Will Hunting (1997) Godfather, The (1972) Shawshank Redemption, The (1994) L.A. Confidential (1997)




[1] Disclaimer: SAS may reference other websites or content or resources for use at Customer’s sole discretion. SAS has no control over any websites or resources that are provided by companies or persons other than SAS. Customer acknowledges and agrees that SAS is not responsible for the availability or use of any such external sites or resources, and does not endorse any advertising, products, or other materials on or available from such websites or resources. Customer acknowledges and agrees that SAS is not liable for any loss or damage that may be incurred by Customer or its end users as a result of the availability or use of those external sites or resources, or as a result of any reliance placed by Customer or its end users on the completeness, accuracy, or existence of any advertising, products, or other materials on, or available from, such websites or resources.

Last updated: February 14, 2022