Netflixの中の人が書いた「Ten lessons learned from building machine learning system」というブログエントリーのメモ

TechnoCalifornia: Ten Lessons Learned from Building (real-life impactful) Machine Learning Systems

少し前のエントリーですが、ざくっとポイントをメモしました。

Ten lessons

More data and better models
- どちらかだけにフォーカスすると、うまくいかない
  - あなたの問題では追加データが必要かもしれないし、またはモデルを改良する必要があるかもしれない
You might not need all your big data
- 全データのうち、どのサブセットを使うかが重要
- データが多いほど精度が上がるというわけではない
The fact that a more complex models does not improve things does not mean you don't need one
- 「複雑なモデルを使えば改善するのではない」という事実は、複雑なモデルが不要ということを意味するのではない
- 多次元の複雑な特徴を使う場合は複雑なモデルが必要、逆も同じ
- 特徴量とモデルをパラレルに改善する事が重要
Be thoughtful about how you define your training/testing data sets
- データセットは注意深く扱うようにしなければいけない
  - positiveとnegativeのラベルを付けるのは簡単ではない
Learn to deal with (the curse of) presentation bias
- 表示のバイアスを考慮してアルゴリズムを考案する
The UI is the only communication channel between the Algorithm and what matters most: the Users
- （機械学習アルゴリズムの結果としての）表示とユーザインタフェースの重要性
  - 機械学習アルゴリズムとUIは密接に関係
Data and Models are great. You know what is even better? The right evaluation approach
- 最も重要なレッスンかもしれない
- 正しい評価手法がなければデータ、モデル、インフラは意味をなさない
- Offline ExperimentationとOnline Experimentation
  - Offline Experimentationはおなじみの機械学習アルゴリズムの評価ステップのこと
  - Online Experimentationで最も有用なのはA/Bテスト
Distributing algorithms? Yes, but what at level?
- 分散処理させたいタイミングがやってくるけれど、問題はどのレベルで分散させるか
- 3つのレベル
  - Lv.1. 個々に独立した全データのサブセット単位
  - Lv.2. ハイパーパラメータの組み合わせ単位
  - Lv.3. 各学習データサンプルにおけるモデル学習の単位
- AWSを使って分散処理させた例がNetflixのTechblogに掲載
It pays off to be smart about your hyperparamters
- ハイパーパラメータのチューニングを繰り返し実行するための仕組みは重要
There are things you can do offline and there are things you can’t…and there is Nearline for everything in between
- システムアーキテクチャの話
  - OfflineとOnline、Nearlineの3層
  - Nearline: OnlineとOfflineの間（造語?）
- レイテンシーについて考えると、MLアルゴリズムをどのレイヤで実行するかにブレイクダウンする手助けになる
- NetflixのTechblogに掲載