kaggleのデータファイルをwgetで並列ダウンロードする方法を調べた。スクリプトを書いたりすることなく、コマンドラインで完結するのでお手軽。
1. pupでリンクアドレスを取得する
chromeのデベロッパーツール等で確認、CSSセレクタを使ってアドレスを取り出す。baseのURLはawkで付け足した。
$ curl -s https://www.kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/data | pup 'tbody tr td a attr{href}' | awk '{print "https://kaggle.com" $1}'
結果はこんな感じ。
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/dev_train_basic.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/cookie_all_basic.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/ipagg_all.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/dev_test_basic.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/property_category.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/sampleSubmission.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/id_all_ip.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/id_all_property.csv.zip https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/database.sqlite.zip
リダイレクトしてurls.txtに保存する。
2. ブラウザで手作業
ここは手作業で行う。
ompetition rulesを確認するのと、wgetで使うためのcookieを保存する。cookieはchromeのextensionを使う。
こちらも参考に。
3. xargsとwgetで並列ダウンロード
xargsとwgetを合わせて使うと並列ダウンロードが出来る!
wgetのオプションで手作業では保存したcookies.txtを指定してあげる。
$ xargs -P 20 -n 1 wget --load-cookies cookies.txt < urls.txt
後はダウンロードが完了するのを待つだけ。
sampleSubmission.csv.zip 100%[======================================================>] 126.38K 336KB/s in 0.4s 2015-06-28 22:11:23 (336 KB/s) - 'sampleSubmission.csv.zip' saved [129409/129409] dev_test_basic.csv.zip 100%[======================================================>] 713.96K 404KB/s in 1.8s 2015-06-28 22:11:25 (404 KB/s) - 'dev_test_basic.csv.zip' saved [731093/731093] dev_train_basic.csv.zip 100%[======================================================>] 2.14M 473KB/s in 4.6s 2015-06-28 22:11:28 (473 KB/s) - 'dev_train_basic.csv.zip' saved [2248158/2248158] property_category.csv.zip 100%[======================================================>] 2.96M 374KB/s in 8.1s s 2015-06-28 22:11:31 (376 KB/s) - 'property_category.csv.zip' saved [3109199/3109199] cookie_all_basic.csv.zip 100%[======================================================>] 34.08M 848KB/s in 40s s 2015-06-28 22:12:03 (863 KB/s) - 'cookie_all_basic.csv.zip' saved [35739382/35739382] ipagg_all.csv.zip 100%[======================================================>] 112.23M 1.05MB/s in 2m 34s s 2015-06-28 22:13:57 (747 KB/s) - 'ipagg_all.csv.zip' saved [117684953/117684953] id_all_ip.csv.zip 100%[======================================================>] 225.42M 876KB/s in 3m 55s s database.sqlite.zip 4%[=> ] 162.76M 738KB/s eta 78m 35s2015-06-28 22:15:18 (982 KB/s) - 'id_all_ip.csv.zip' saved [236366458/236366458] id_all_property.csv.zip 100%[======================================================>] 356.89M 1.53MB/s in 4m 53s s 2015-06-28 22:16:16 (1.22 MB/s) - 'id_all_property.csv.zip' saved [374222833/374222833] database.sqlite.zip 100%[======================================================>] 3.35G 1.59MB/s in 39m 2s s 2015-06-28 22:50:25 (1.46 MB/s) - 'database.sqlite.zip' saved [3594725459/3594725459]