파이썬 surprise 라이브러리의 SVD 모델
이해하기 쉽고, 장황하지 않은 자료를 기반으로 강의를 진행합니다.
잔재미코딩 소식 공유
좀더 제약없이, IT 컨텐츠를 공유하고자, 자체 온라인 강의 사이트와 유투브 채널을
오픈하였습니다
응원해주시면, 곧 좋은 컨텐츠를 만들어서 공유하겠습니다
응원해주시면, 곧 좋은 컨텐츠를 만들어서 공유하겠습니다
● 잔재미코딩 유투브 오픈
[구독해보기]
7. 파이썬 surprise 라이브러리의 SVD 모델¶
- The prediction $\hat{r}_{ui}$ is set as:
$\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u$
If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\) and \(q_i\).
- 최소화는 SGD 사용
- 주요 파라미터
- n_factors – The number of factors. Default is 100.
- n_epochs – The number of iteration of the SGD procedure. Default is 20.
- lr_all – The learning rate for all parameters. Default is 0.005
7.1. MovieLens 데이터를 기반으로 하는 실제 예제¶
MovieLens 데이터 로드¶
In [2]:
import os
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import dump
data = Dataset.load_builtin('ml-100k')
본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
퀄러티가 다른 온라인 강의로 차근차근 익혀보세요
In [3]:
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rate", "id"])
In [4]:
df.head()
Out[4]:
트레이닝 + 모델 저장¶
In [5]:
import os
from surprise import SVD
from surprise import Dataset
from surprise import dump
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)
# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())
# Dump algorithm and reload it.
file_name = os.path.expanduser('dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)
# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')
정확도 계산¶
본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
퀄러티가 다른 온라인 강의로 차근차근 익혀보세요
In [6]:
from surprise import Dataset
from surprise import SVD
from surprise import accuracy
data = Dataset.load_builtin('ml-100k')
algo = SVD()
trainset = data.build_full_trainset()
algo.train(trainset)
testset = trainset.build_testset()
predictions = algo.test(testset)
accuracy.rmse(predictions)
Out[6]:
모델 최적화 (파라미터 튜닝)¶
- surprise의 GridSearch class
- 주요 파라미터
- algo_class (AlgoBase) – A class object of of the algorithm to evaluate.
- param_grid (dict) – The dictionary has algo_class parameters as keys (string) and list of parameters as the desired values to try. All combinations will be evaluated with desired algorithm.
- measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
- verbose (int) – Level of verbosity. If 0, nothing is printed. If 1, accuracy measures for each parameters combination are printed, with combination values. If 2, folds accuracy values are also printed. Default is 1
- 주요 파라미터
In [7]:
import random
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import GridSearch
# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearch(SVD, param_grid, measures=['RMSE'], verbose=1)
print(grid_search.best_params)
grid_search.evaluate(data)
사용자별 영화 추천 예 - offline 방식으로 별도 테이블을 만들어서 저장하고, 해당 사용자 로그인시 웹에서 추천 시나리오¶
본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
퀄러티가 다른 온라인 강의로 차근차근 익혀보세요
In [ ]:
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
def get_top_n(predictions, n=10):
'''Return the top-N recommendation for each user from a set of predictions.
Args:
predictions(list of Prediction objects): The list of predictions, as
returned by the test method of an algorithm.
n(int): The number of recommendation to output for each user. Default
is 10.
Returns:
A dict where keys are user (raw) ids and values are lists of tuples:
[(raw item id, rating estimation), ...] of size n.
'''
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=10)
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
print(uid, [iid for (iid, _) in user_ratings])
추천성능 평가¶
- RMSE (Root Mean Squared Error) : 평균 제곱근 오차
- MAE (Mean Absolute Error) : 평균 절대 오차
알고리즘 평가¶
In [17]:
import surprise
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
본 자료와 같이 IT 기술을 잘 정리하여, 온라인 강의로 제공하고 있습니다
체계적으로 전문가 레벨까지 익힐 수 있도록 온라인 강의 로드맵을 제공합니다
In [23]:
sim_options = {'name': 'msd'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Out[23]:
In [24]:
sim_options = {'name': 'cosine'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Out[24]:
In [25]:
sim_options = {'name': 'pearson'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Out[25]:
In [58]:
sim_options = {'name': 'pearson_baseline'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Out[58]:
In [26]:
algo = surprise.SVD()
surprise.evaluate(algo, data)
Out[26]: