Try word2vec in 5 minutes

Task

word2vecを使ったテキスト分類の問題（質問文？回答文？）をやってみます。

Steps

datasetの入手
pre-trained modelの入手
datasetとpre-trained modelのロード
学習
評価

1. datasetの入手

まずはデータセットが必要ですが、UCSD（カリフォルニア大学サンディエゴ校）の研究グループが、WWW2016で公開したAmazon question/answer dataを使います。

jsonの中身は、質問文、回答文の他に質問のタイプ（yes/noかopen-endedか）などが含まれていますが、ここでは質問文（questin）、回答文（answer）を使っていきます。

{'questionType': 'yes/no', 'asin': 'B00004U9JP', 'answerTime': 'Jun 27, 2014', 'unixTime': 1403852400, 'question': 'I have a 9 year old Badger 1 that needs replacing, will this Badger 1 install just like the original one?', 'answerType': '?', 'answer': 'I replaced my old one with this without a hitch.'}

2. pre-trained modelの入手

Google Newsのデータセットで学習されたmodelを使います。300万語の300次元ベクトルが含まれたmodelになっていて、3.4GBほどありますのでダウンロードには少し時間がかかります。下のページに、GoogleNews-vectors-negative300.bin.gzのリンクがあります。

https://code.google.com/archive/p/word2vec/

3. datasetとpre-trained modelのロード

pre-trained modelのロードはgensimを使えば簡単です。

# load pre-trained word2vec model
googlenews_w2v = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

datasetの方はjson.loadを使いたい所ですが、シングルクォーテーションが使われていて、JSONDecodeErrorになってしまいます。仕方がないので、ast.literal_evalを使います。

# load datasets
# you cannot use json.load because this json file is invalid format...(using single quote)

questions = []
answers = []
with open('qa_Appliances.json', 'r') as f:
    for line in f:
        js = ast.literal_eval(line)
        questions.append(js['question'])
        answers.append(js['answer'])

参考。 stackoverflow.com

4. 学習

次は、学習です。最初に、データセットを学習用とテスト用に分割します。

# Split dataset into train and test set
qa_texts = np.array(questions + answers)
qa_labels = np.zeros(len(qa_texts), dtype=np.int)
qa_labels[len(questions):] = 1  # question: 0, answer: 1

qa_idx = np.random.permutation(len(qa_texts))
qa_texts = qa_texts[qa_idx]
qa_labels = qa_labels[qa_idx]

X_train, X_test, y_train, y_test = train_test_split(qa_texts, qa_labels)

特徴量ベクトルの抽出は、シンプルに分類したいテキストに含まれる全単語のベクトルの平均を使うことにします。このアイデアは、下記のブログを参考にしました。また、識別器はRandomForestを使います。

nadbordrozd.github.io

# Simple word embedding feature by averaging word vectors for all words in a text
# ref: http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/#comment-3233012354
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, dim):
        self.word2vec = word2vec
        self.dim = dim
        
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec] or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

# Train a model
rf_amazon_qa = Pipeline([
    ('word2vec', MeanEmbeddingVectorizer(googlenews_w2v, googlenews_w2v.vector_size)), 
    ('randomforest', RandomForestClassifier(n_estimators=200))])
rf_amazon_qa.fit(X_train, y_train)

5. 評価

最後に、テストデータに関して評価をします。結果は、8割弱の精度となりました。

# Evaluation
y_pred = rf_amazon_qa.predict(X_test)
print(classification_report(y_test, y_pred))

            precision    recall  f1-score   support

          0       0.77      0.81      0.79      2263
          1       0.80      0.76      0.78      2243

avg / total       0.78      0.78      0.78      4506