stMind

You'll never blog alone

kaggleのデータファイルをwgetで並列ダウンロード

kaggleのデータファイルをwgetで並列ダウンロードする方法を調べた。スクリプトを書いたりすることなく、コマンドラインで完結するのでお手軽。

1. pupでリンクアドレスを取得する

chromeデベロッパツール等で確認、CSSセレクタを使ってアドレスを取り出す。baseのURLはawkで付け足した。

$ curl -s https://www.kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/data | pup 'tbody tr td a attr{href}' | awk '{print "https://kaggle.com" $1}'

結果はこんな感じ。

https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/dev_train_basic.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/cookie_all_basic.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/ipagg_all.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/dev_test_basic.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/property_category.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/sampleSubmission.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/id_all_ip.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/id_all_property.csv.zip
https://kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/download/database.sqlite.zip

リダイレクトしてurls.txtに保存する。

2. ブラウザで手作業

ここは手作業で行う。
ompetition rulesを確認するのと、wgetで使うためのcookieを保存する。cookiechromeのextensionを使う。

chrome.google.com

こちらも参考に。 Web Scraping: III | Yaser Martinez | Personal blog

3. xargsとwgetで並列ダウンロード

xargsとwgetを合わせて使うと並列ダウンロードが出来る!

blog.layer8.sh

wgetのオプションで手作業では保存したcookies.txtを指定してあげる。

$ xargs -P 20 -n 1 wget --load-cookies cookies.txt < urls.txt

後はダウンロードが完了するのを待つだけ。

sampleSubmission.csv.zip        100%[======================================================>] 126.38K   336KB/s   in 0.4s   

2015-06-28 22:11:23 (336 KB/s) - 'sampleSubmission.csv.zip' saved [129409/129409]

dev_test_basic.csv.zip          100%[======================================================>] 713.96K   404KB/s   in 1.8s   

2015-06-28 22:11:25 (404 KB/s) - 'dev_test_basic.csv.zip' saved [731093/731093]

dev_train_basic.csv.zip         100%[======================================================>]   2.14M   473KB/s   in 4.6s   

2015-06-28 22:11:28 (473 KB/s) - 'dev_train_basic.csv.zip' saved [2248158/2248158]

property_category.csv.zip       100%[======================================================>]   2.96M   374KB/s   in 8.1s   s

2015-06-28 22:11:31 (376 KB/s) - 'property_category.csv.zip' saved [3109199/3109199]

cookie_all_basic.csv.zip        100%[======================================================>]  34.08M   848KB/s   in 40s    s

2015-06-28 22:12:03 (863 KB/s) - 'cookie_all_basic.csv.zip' saved [35739382/35739382]

ipagg_all.csv.zip               100%[======================================================>] 112.23M  1.05MB/s   in 2m 34s s

2015-06-28 22:13:57 (747 KB/s) - 'ipagg_all.csv.zip' saved [117684953/117684953]

id_all_ip.csv.zip               100%[======================================================>] 225.42M   876KB/s   in 3m 55s s

database.sqlite.zip               4%[=>                                                     ] 162.76M   738KB/s   eta 78m 35s2015-06-28 22:15:18 (982 KB/s) - 'id_all_ip.csv.zip' saved [236366458/236366458]

id_all_property.csv.zip         100%[======================================================>] 356.89M  1.53MB/s   in 4m 53s s

2015-06-28 22:16:16 (1.22 MB/s) - 'id_all_property.csv.zip' saved [374222833/374222833]

database.sqlite.zip             100%[======================================================>]   3.35G  1.59MB/s   in 39m 2s s
2015-06-28 22:50:25 (1.46 MB/s) - 'database.sqlite.zip' saved [3594725459/3594725459]