File Management
To help us optimize storage space and improve the /project file system performance, we kindly ask you to compress your files into archives (e.g., tar files) rather than storing numerous individual files in the /project file system. Compressing files not only reduces the amount of storage used but also simplifies file management (e.g. file back up and recovery) and transfer.
File Compression
In order to save space, please compress your files before transferring them into your group's /project file system. The gzip
command can be used to compress files, but it uses a single thread on a single core.
The pigz
command is a parallel implementation of gzip
that can run with multiple threads, making use of multiple cores. The unpigz
command is equivalent to gunzip
and it can be used to uncompress gzip'ed files. The .gz files created by pigz are compatible with gzip/gunzip. The pigz command is particularly helpful to compress a large number of files (or a folder) or to compress large files.
The compute nodes on Sapelo2 and on the teaching cluster have pigz installed centrally, so you don't need to load any modules in order to use this command. The help page for this command shows the available options, and it can be viewed with the command
pigz --help
Some simple examples
Compress a file
pigz filename
Compress a file with best compression rate
pigz -9 filename pigz --best filename
Uncompress a file
unpigz filename.gz
We suggest that you run pigz
in an interactive job that requests multiple cores and run pigz
with the '-p num_thread' option to specify the numnber of threads (num_threads) to use.
For example, start an interactive session with 10 cores and 4GB of RAM with
interact -c 10 --mem=4g
and then run pigz with 10 threads with
pigz --best -p 10 my_big_file
To recursive compress all files in a directory (e.g. called dirname) use the -r option. For example, using 10 threads
pigz --best -r -p 10 dirname
To recursive uncompress all files in a directory:
unpigz -r dirname
Sample timing comparison
This example shows the time required to compress a single 10GB file in an interactive session that has 10 cores available. The gzip command took about 84 seconds, while the pigz command took about 22 seconds, to compress a 10GB file down to 10MB (gzip) and 12MB (pigz).
[shtsai@ss-sub3]$ interact -c 10 srun --pty --cpus-per-task=10 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l [shtsai@c2-17]$ ls -lh 10g.img -rw-r--r-- 1 shtsai gclab 10G Dec 16 2021 10g.img [shtsai@c2-17]$ time gzip 10g.img real 1m24.437s user 1m19.214s sys 0m4.932s [shtsai@c2-17]$ ls -lh 10g.img.gz -rw-r--r-- 1 shtsai gclab 10M Dec 16 2021 10g.img.gz [shtsai@c2-17]$ time gunzip 10g.img.gz real 1m23.511s user 1m3.506s sys 0m18.855s [shtsai@c2-17]$ ls -lh 10g.img -rw-r--r-- 1 shtsai gclab 10G Dec 16 2021 10g.img [shtsai@c2-17]$ time pigz --best -p 10 10g.img real 0m22.028s user 1m39.639s sys 0m9.134s [shtsai@c2-17]$ ls -lh 10g.img.gz -rw-r--r-- 1 shtsai gclab 12M Dec 16 2021 10g.img.gz [shtsai@c2-17 shtsai]$ time unpigz 10g.img.gz real 0m36.700s user 0m45.593s sys 0m21.935s [shtsai@c2-17]$ ls -lh 10g.img -rw-r--r-- 1 shtsai gclab 10G Dec 16 2021 10g.img
Creating tar files
Having a large number of files in a file system can overload the storage metadata server and delay data recovery from backups, etc. If you need to store a large number of files in your group's /project area, instead of storing a large number of individual files, please first create a tar file with the files or with a directory, and transfer the tar file to /project.
The tar
command can be run in interactive job on Sapelo2 with
tar cvf dirname.tar dirname
Note: Please do not run tar
directly on the Sapelo2 login nodes.
A tar file can be compressed with pigz
using multiple cores in an interactive session that requested multiple cores with
pigz dirname.tar
Alternatively, you could use pigz
to compress the files in your directory, before creating a tar file.
To extract the files from a tar file:
tar xvf dirname.tar
To extract the files from a tar.gz file:
tar zxvf dirname.tar.gz