The official Imagenet website is a pain to navigate. If you want to download the most common version of Imagenet (with about 150 gigabytes of images), you can get it from Kaggle: https://www.kaggle.com/c/imagenet-object-localization-challenge/data?select=imagenet_object_localization_patched2019.tar.gz.
But if you prefer to download from the command line without logging-in through Kaggle, or if you are looking for the full version of Imagenet (with 1.5 terabytes of images), you can use AcademicTorrents. AcademicTorrents is a great way to get datasets and can be faster than a direct download! The full list of Imagenet datasets is here: https://academictorrents.com/collection/imagenet. You will need a client torrent, such as
aria2c or the AcademicTorrents Python client.
Links for the ILSVRC 2012 dataset
aria2c https://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent aria2c https://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent
And here is the link for the full version of Imagenet (1.31TB). The one-liner is:
Note that Aria2c will keep seeding the torrent for a bit after the download is complete. You can stop it if you prefer. Otherwise, it will stop after 1 minute and help your fellow researchers by sharing the dataset in the meantime.
To get a torrent client, you can do one of the following:
- If you are sudo on Debian or Ubuntu, you can simply do:
sudo apt-get install -y aria2.
- To install
aria2cthrough Conda without being sudo, you can do:
conda install -c bioconda aria2 -y
- Alternatively, if you just have a Python environment, you can do
pip install academictorrentsand then use their command line tool:
at-get 564a77c1e1119da199ff32622a1609431b9f1c47(you don’t need to put the rest of the URL)
Extracting the images in parallel
Finally, if you need to extract all the archives from the training set or the full version, I recommend to use GNU Parallel. Here’s an example with 4 jobs that also deletes the archives after they are extracted (you might want to add the
--dry-run option to
parallel to check what it does before you run it):
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train && tar -xvf ILSVRC2012_img_train.tar rm ILSVRC2012_img_train.tar ls *.tar | sed -e 's/\.tar$//' | parallel -I@ -j4 'mkdir @; tar -xf @.tar --directory @; rm @.tar'
For the validation set, there are no nested archives so you can just extract images and process them with this script from the PyTorch documentation:
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash