파이썬 surprise 라이브러리의 SVD 모델

이해하기 쉽고, 장황하지 않은 자료를 기반으로 강의를 진행합니다.
잔재미코딩 소식 공유
좀더 제약없이, IT 컨텐츠를 공유하고자, 자체 온라인 사이트와, 다음 두 채널도 오픈하였습니다
응원해주시면, 곧 좋은 컨텐츠를 만들어서 공유하겠습니다
●  잔재미코딩 뉴스레터 오픈 [구독해보기]
●  잔재미코딩 유투브 오픈 [구독해보기]

7. 파이썬 surprise 라이브러리의 SVD 모델

  • The prediction $\hat{r}_{ui}$ is set as:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u$

If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\) and \(q_i\).

  • 최소화는 SGD 사용 $$\begin{split}b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u)\\ b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i)\\ p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}$$
  • 주요 파라미터
    • n_factors – The number of factors. Default is 100.
    • n_epochs – The number of iteration of the SGD procedure. Default is 20.
    • lr_all – The learning rate for all parameters. Default is 0.005

7.1. MovieLens 데이터를 기반으로 하는 실제 예제

MovieLens 데이터 로드

In [2]:
import os
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
In [3]:
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rate", "id"])
In [4]:
df.head()
Out[4]:
user item rate id
0 196 242 3.0 881250949
1 186 302 3.0 891717742
2 22 377 1.0 878887116
3 244 51 2.0 880606923
4 166 346 1.0 886397596

트레이닝 + 모델 저장

In [5]:
import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.train(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')
Predictions are the same

정확도 계산

본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
퀄러티가 다른 온라인 강의로 차근차근 익혀보세요
In [6]:
from surprise import Dataset
from surprise import SVD
from surprise import accuracy


data = Dataset.load_builtin('ml-100k')

algo = SVD()

trainset = data.build_full_trainset()
algo.train(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)

accuracy.rmse(predictions)
RMSE: 0.6763
Out[6]:
0.6763421071136434

모델 최적화 (파라미터 튜닝)

  • surprise의 GridSearch class
    • 주요 파라미터
      • algo_class (AlgoBase) – A class object of of the algorithm to evaluate.
      • param_grid (dict) – The dictionary has algo_class parameters as keys (string) and list of parameters as the desired values to try. All combinations will be evaluated with desired algorithm.
      • measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
      • verbose (int) – Level of verbosity. If 0, nothing is printed. If 1, accuracy measures for each parameters combination are printed, with combination values. If 2, folds accuracy values are also printed. Default is 1
In [7]:
import random

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import GridSearch


# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearch(SVD, param_grid, measures=['RMSE'], verbose=1)
print(grid_search.best_params)
grid_search.evaluate(data)
Grid Search...
defaultdict(<class 'list'>, {})
Running grid search for the following parameter combinations:
{'n_epochs': 5, 'lr_all': 0.002}
{'n_epochs': 5, 'lr_all': 0.005}
{'n_epochs': 10, 'lr_all': 0.002}
{'n_epochs': 10, 'lr_all': 0.005}
Resulsts:
{'n_epochs': 5, 'lr_all': 0.002}
{'RMSE': 0.98969008899623423}
----------
{'n_epochs': 5, 'lr_all': 0.005}
{'RMSE': 0.96388170867799128}
----------
{'n_epochs': 10, 'lr_all': 0.002}
{'RMSE': 0.96890656349975923}
----------
{'n_epochs': 10, 'lr_all': 0.005}
{'RMSE': 0.95219313886797485}
----------

사용자별 영화 추천 예 - offline 방식으로 별도 테이블을 만들어서 저장하고, 해당 사용자 로그인시 웹에서 추천 시나리오

본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
퀄러티가 다른 온라인 강의로 차근차근 익혀보세요
In [ ]:
from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

추천성능 평가

  1. RMSE (Root Mean Squared Error) : 평균 제곱근 오차 $$ \text{RMSE} = \sqrt{\frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}(r_{ui} - \hat{r}_{ui})^2} $$
  2. MAE (Mean Absolute Error) : 평균 절대 오차 $$ \text{MAE} = \frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}|r_{ui} - \hat{r}_{ui}| $$

알고리즘 평가

In [17]:
import surprise
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
체계적으로 전문가 레벨까지 익힐 수 있도록 온라인 강의 로드맵을 제공합니다
In [23]:
sim_options = {'name': 'msd'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9911
MAE:  0.7833
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9833
MAE:  0.7771
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9894
MAE:  0.7819
------------
------------
Mean RMSE: 0.9879
Mean MAE : 0.7808
------------
------------
Out[23]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.78332085007914287,
                             0.7771041490260826,
                             0.78185859406389302],
                            'rmse': [0.99109548719657858,
                             0.98332811659672703,
                             0.9893776110540401]})
In [24]:
sim_options = {'name': 'cosine'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0244
MAE:  0.8108
------------
Fold 2
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0187
MAE:  0.8061
------------
Fold 3
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0226
MAE:  0.8094
------------
------------
Mean RMSE: 1.0219
Mean MAE : 0.8088
------------
------------
Out[24]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.81079510431791924,
                             0.80605631128339117,
                             0.80941768920884594],
                            'rmse': [1.0243634073960175,
                             1.0187482414191331,
                             1.0225720777877443]})
In [25]:
sim_options = {'name': 'pearson'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0239
MAE:  0.8120
------------
Fold 2
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0167
MAE:  0.8083
------------
Fold 3
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0214
MAE:  0.8099
------------
------------
Mean RMSE: 1.0207
Mean MAE : 0.8101
------------
------------
Out[25]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.81200632767328063,
                             0.80826931547507119,
                             0.80989597847789219],
                            'rmse': [1.0239246403571562,
                             1.0167128227144424,
                             1.0213664721488831]})
In [58]:
sim_options = {'name': 'pearson_baseline'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0049
MAE:  0.7968
------------
Fold 2
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0103
MAE:  0.7985
------------
Fold 3
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0110
MAE:  0.8002
------------
------------
Mean RMSE: 1.0087
Mean MAE : 0.7985
------------
------------
Out[58]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.79675695245350209,
                             0.7984790061210334,
                             0.8002162404915526],
                            'rmse': [1.0048624637086425,
                             1.0102925000197331,
                             1.0110090350939811]})
In [26]:
algo = surprise.SVD()
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9498
MAE:  0.7498
------------
Fold 2
RMSE: 0.9458
MAE:  0.7463
------------
Fold 3
RMSE: 0.9446
MAE:  0.7453
------------
------------
Mean RMSE: 0.9467
Mean MAE : 0.7471
------------
------------
Out[26]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.749794336106132,
                             0.7462662006765739,
                             0.74533450362095799],
                            'rmse': [0.94976543487583887,
                             0.94580670524240762,
                             0.94459061787140985]})