파이썬 surprise 라이브러리의 SVD 모델

이해하기 쉽고, 장황하지 않은 자료를 기반으로 강의를 진행합니다.

잔재미코딩 소식 공유

좀더 제약없이, IT 컨텐츠를 공유하고자, 자체 온라인 강의 사이트와 유투브 채널을 오픈하였습니다
응원해주시면, 곧 좋은 컨텐츠를 만들어서 공유하겠습니다

● 잔재미코딩 유투브 오픈 [구독해보기]

7. 파이썬 surprise 라이브러리의 SVD 모델¶

The prediction $\hat{r}_{ui}$ is set as:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u$

If user $u$ is unknown, then the bias $b_u$ and the factors $p_u$ are assumed to be zero. The same applies for item $i$ with $b_i$ and $q_i$.

최소화는 SGD 사용

$$\begin{split}b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u)\\ b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i)\\ p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}$$

주요 파라미터
- n_factors – The number of factors. Default is 100.
- n_epochs – The number of iteration of the SGD procedure. Default is 20.
- lr_all – The learning rate for all parameters. Default is 0.005

7.1. MovieLens 데이터를 기반으로 하는 실제 예제¶

MovieLens 데이터 로드¶

In [2]:

import os
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')

본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다

퀄러티가 다른 온라인 강의로 차근차근 익혀보세요

처음하는 파이썬 데이터 분석 강의 (pandas, 데이터 전처리, EDA)

파이썬으로 데이터 전처리부터 데이터 분석 및 시각화를 실제 데이터로 견고하게 익힐 수 있도록 꾸몄습니다

자세히 알아보기 모든 강좌 보기

In [3]:

df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rate", "id"])

In [4]:

df.head()

Out[4]:

	user	item	rate	id
0	196	242	3.0	881250949
1	186	302	3.0	891717742
2	22	377	1.0	878887116
3	244	51	2.0	880606923
4	166	346	1.0	886397596

트레이닝 + 모델 저장¶

In [5]:

import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.train(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')

Predictions are the same

정확도 계산¶

본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다

퀄러티가 다른 온라인 강의로 차근차근 익혀보세요

처음하는 파이썬 머신러닝 부트캠프 강의

처음 익히는 파이썬 머신러닝 기술을 쉽게 익히며, 실제 kaggle 문제까지 연습해볼 수 있도록 꾸몄습니다

자세히 알아보기 모든 강좌 보기

In [6]:

from surprise import Dataset
from surprise import SVD
from surprise import accuracy


data = Dataset.load_builtin('ml-100k')

algo = SVD()

trainset = data.build_full_trainset()
algo.train(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)

accuracy.rmse(predictions)

RMSE: 0.6763

Out[6]:

0.6763421071136434

모델 최적화 (파라미터 튜닝)¶

surprise의 GridSearch class
- 주요 파라미터
  - algo_class (AlgoBase) – A class object of of the algorithm to evaluate.
  - param_grid (dict) – The dictionary has algo_class parameters as keys (string) and list of parameters as the desired values to try. All combinations will be evaluated with desired algorithm.
  - measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
  - verbose (int) – Level of verbosity. If 0, nothing is printed. If 1, accuracy measures for each parameters combination are printed, with combination values. If 2, folds accuracy values are also printed. Default is 1

In [7]:

import random

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import GridSearch


# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearch(SVD, param_grid, measures=['RMSE'], verbose=1)
print(grid_search.best_params)
grid_search.evaluate(data)

Grid Search...
defaultdict(<class 'list'>, {})
Running grid search for the following parameter combinations:
{'n_epochs': 5, 'lr_all': 0.002}
{'n_epochs': 5, 'lr_all': 0.005}
{'n_epochs': 10, 'lr_all': 0.002}
{'n_epochs': 10, 'lr_all': 0.005}
Resulsts:
{'n_epochs': 5, 'lr_all': 0.002}
{'RMSE': 0.98969008899623423}
----------
{'n_epochs': 5, 'lr_all': 0.005}
{'RMSE': 0.96388170867799128}
----------
{'n_epochs': 10, 'lr_all': 0.002}
{'RMSE': 0.96890656349975923}
----------
{'n_epochs': 10, 'lr_all': 0.005}
{'RMSE': 0.95219313886797485}
----------

사용자별 영화 추천 예 - offline 방식으로 별도 테이블을 만들어서 저장하고, 해당 사용자 로그인시 웹에서 추천 시나리오¶

본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다

퀄러티가 다른 온라인 강의로 차근차근 익혀보세요

처음하는 딥러닝과 파이토치 부트캠프 강의

처음 익히는 딥러닝과 파이토치 기술을 견고하게 차근차근 익힐 수 있도록 꾸몄습니다

자세히 알아보기 모든 강좌 보기

In [ ]:

from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

알고리즘 평가¶

In [17]:

import surprise
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다

체계적으로 전문가 레벨까지 익힐 수 있도록 온라인 강의 로드맵을 제공합니다

데이터 분석/과학 로드맵

데이터 분석가와 데이터 과학자 직군의 기본기를 차근차근 쌓을 수 있도록 꾸몄습니다

자세히 알아보기 모든 강좌 보기

In [23]:

sim_options = {'name': 'msd'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9911
MAE:  0.7833
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9833
MAE:  0.7771
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9894
MAE:  0.7819
------------
------------
Mean RMSE: 0.9879
Mean MAE : 0.7808
------------
------------

Out[23]:

CaseInsensitiveDefaultDict(list,
                           {'mae': [0.78332085007914287,
                             0.7771041490260826,
                             0.78185859406389302],
                            'rmse': [0.99109548719657858,
                             0.98332811659672703,
                             0.9893776110540401]})

In [24]:

sim_options = {'name': 'cosine'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0244
MAE:  0.8108
------------
Fold 2
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0187
MAE:  0.8061
------------
Fold 3
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0226
MAE:  0.8094
------------
------------
Mean RMSE: 1.0219
Mean MAE : 0.8088
------------
------------

Out[24]:

CaseInsensitiveDefaultDict(list,
                           {'mae': [0.81079510431791924,
                             0.80605631128339117,
                             0.80941768920884594],
                            'rmse': [1.0243634073960175,
                             1.0187482414191331,
                             1.0225720777877443]})

In [25]:

sim_options = {'name': 'pearson'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0239
MAE:  0.8120
------------
Fold 2
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0167
MAE:  0.8083
------------
Fold 3
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0214
MAE:  0.8099
------------
------------
Mean RMSE: 1.0207
Mean MAE : 0.8101
------------
------------

Out[25]:

CaseInsensitiveDefaultDict(list,
                           {'mae': [0.81200632767328063,
                             0.80826931547507119,
                             0.80989597847789219],
                            'rmse': [1.0239246403571562,
                             1.0167128227144424,
                             1.0213664721488831]})

In [58]:

sim_options = {'name': 'pearson_baseline'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0049
MAE:  0.7968
------------
Fold 2
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0103
MAE:  0.7985
------------
Fold 3
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0110
MAE:  0.8002
------------
------------
Mean RMSE: 1.0087
Mean MAE : 0.7985
------------
------------

Out[58]:

CaseInsensitiveDefaultDict(list,
                           {'mae': [0.79675695245350209,
                             0.7984790061210334,
                             0.8002162404915526],
                            'rmse': [1.0048624637086425,
                             1.0102925000197331,
                             1.0110090350939811]})

In [26]:

algo = surprise.SVD()
surprise.evaluate(algo, data)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9498
MAE:  0.7498
------------
Fold 2
RMSE: 0.9458
MAE:  0.7463
------------
Fold 3
RMSE: 0.9446
MAE:  0.7453
------------
------------
Mean RMSE: 0.9467
Mean MAE : 0.7471
------------
------------

Out[26]:

CaseInsensitiveDefaultDict(list,
                           {'mae': [0.749794336106132,
                             0.7462662006765739,
                             0.74533450362095799],
                            'rmse': [0.94976543487583887,
                             0.94580670524240762,
                             0.94459061787140985]})

파이썬 surprise 라이브러리의 SVD 모델

7. 파이썬 surprise 라이브러리의 SVD 모델¶

7.1. MovieLens 데이터를 기반으로 하는 실제 예제¶

MovieLens 데이터 로드¶

트레이닝 + 모델 저장¶

정확도 계산¶

모델 최적화 (파라미터 튜닝)¶

사용자별 영화 추천 예 - offline 방식으로 별도 테이블을 만들어서 저장하고, 해당 사용자 로그인시 웹에서 추천 시나리오¶

추천성능 평가¶

알고리즘 평가¶