The official Imagenet website is a pain to navigate. If you want to download the most common version of Imagenet (with about 150 gigabytes of images), you can get it from Kaggle: https://www.kaggle.com/c/imagenet-object-localization-challenge/data?select=imagenet_object_localization_patched2019.tar.gz.

But if you prefer to download from the command line without logging-in through Kaggle, or if you are looking for the full version of Imagenet (with 1.5 terabytes of images), you can use AcademicTorrents. AcademicTorrents is a great way to get datasets and can be faster than a direct download! The full list of Imagenet datasets is here: https://academictorrents.com/collection/imagenet. You will need a client torrent, such as aria2c or the AcademicTorrents Python client.

Here are the AcademicTorrents links for the standard ILSVRC 2012 dataset: training set (147.9GB) and validation set (6.7GB). You can download it with:

aria2c https://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent
aria2c https://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent

And here is the link for the full version of Imagenet (1.31TB). The one-liner is:

aria2c https://academictorrents.com/download/564a77c1e1119da199ff32622a1609431b9f1c47.torrent

Note that Aria2c will keep seeding the torrent for a bit after the download is complete. You can stop it if you prefer. Otherwise, it will stop after 1 minute and help your fellow researchers by sharing the dataset in the meantime.

Torrent clients

To get a torrent client, you can do one of the following:

  • If you are sudo on Debian or Ubuntu, you can simply do: sudo apt-get install -y aria2.
  • To install aria2c through Conda without being sudo, you can do: conda install -c bioconda aria2 -y
  • Alternatively, if you just have a Python environment, you can do pip install academictorrents and then use their command line tool: at-get 564a77c1e1119da199ff32622a1609431b9f1c47 (you don’t need to put the rest of the URL)

Extracting the images in parallel

Finally, if you need to extract all the archives from the training set or the full version, I recommend to use GNU Parallel. Here’s an example with 4 jobs that also deletes the archives after they are extracted (you might want to add the --dry-run option to parallel to check what it does before you run it):

mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train && tar -xvf ILSVRC2012_img_train.tar
rm ILSVRC2012_img_train.tar
ls *.tar |  sed -e 's/\.tar$//' | parallel -I@ -j4 'mkdir @; tar -xf @.tar --directory @; rm @.tar'

For the validation set, there are no nested archives so you can just extract images and process them with this script from the PyTorch documentation:

mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash