파이썬 surprise 라이브러리의 SVD 모델

7. 파이썬 surprise 라이브러리의 SVD 모델

  • The prediction $\hat{r}_{ui}$ is set as:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u$

If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\) and \(q_i\).

  • 최소화는 SGD 사용 $$\begin{split}b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u)\\ b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i)\\ p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}$$
  • 주요 파라미터
    • n_factors – The number of factors. Default is 100.
    • n_epochs – The number of iteration of the SGD procedure. Default is 20.
    • lr_all – The learning rate for all parameters. Default is 0.005

7.1. MovieLens 데이터를 기반으로 하는 실제 예제

MovieLens 데이터 로드

import os
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rate", "id"])
df.head()

트레이닝 + 모델 저장

import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.train(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')

정확도 계산

from surprise import Dataset
from surprise import SVD
from surprise import accuracy


data = Dataset.load_builtin('ml-100k')

algo = SVD()

trainset = data.build_full_trainset()
algo.train(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)

accuracy.rmse(predictions)

모델 최적화 (파라미터 튜닝)

  • surprise의 GridSearch class
    • 주요 파라미터
      • algo_class (AlgoBase) – A class object of of the algorithm to evaluate.
      • param_grid (dict) – The dictionary has algo_class parameters as keys (string) and list of parameters as the desired values to try. All combinations will be evaluated with desired algorithm.
      • measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
      • verbose (int) – Level of verbosity. If 0, nothing is printed. If 1, accuracy measures for each parameters combination are printed, with combination values. If 2, folds accuracy values are also printed. Default is 1
import random

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import GridSearch


# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearch(SVD, param_grid, measures=['RMSE'], verbose=1)
print(grid_search.best_params)
grid_search.evaluate(data)

사용자별 영화 추천 예 - offline 방식으로 별도 테이블을 만들어서 저장하고, 해당 사용자 로그인시 웹에서 추천 시나리오

from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

추천성능 평가

  1. RMSE (Root Mean Squared Error) : 평균 제곱근 오차 $$ \text{RMSE} = \sqrt{\frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}(r_{ui} - \hat{r}_{ui})^2} $$
  2. MAE (Mean Absolute Error) : 평균 절대 오차 $$ \text{MAE} = \frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}|r_{ui} - \hat{r}_{ui}| $$

알고리즘 평가

import surprise
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
sim_options = {'name': 'msd'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
sim_options = {'name': 'cosine'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
sim_options = {'name': 'pearson'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
sim_options = {'name': 'pearson_baseline'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
algo = surprise.SVD()
surprise.evaluate(algo, data)