File Management: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
To help us optimize storage space and improve the /project file system performance, we kindly ask you to compress your files into archives (e.g., tar files) rather than storing numerous individual files in the /project file system. Compressing files not only reduces the amount of storage used but also simplifies file management (e.g. file back up and recovery) and transfer.




Line 5: Line 7:
In order to save space, please compress your files before transferring them into your group's /project file system. The <code>gzip</code> command can be used to compress files, but it uses a single thread on a single core.  
In order to save space, please compress your files before transferring them into your group's /project file system. The <code>gzip</code> command can be used to compress files, but it uses a single thread on a single core.  


The <code>pigz</code> command is a parallel implementation of <code>gzip</code> that can run with multiple threads, making use of multiple cores. The <code>unpigz</code> command is equivalent to <code>gunzip</code> and it can be used to uncompress gzip'ed files. The .gz files created by pigz is compatible with gzip/gunzip. The pigz command is particularly helpful to compress a large number of files (or a folder) or to compress large files.
The <code>pigz</code> command is a parallel implementation of <code>gzip</code> that can run with multiple threads, making use of multiple cores. The <code>unpigz</code> command is equivalent to <code>gunzip</code> and it can be used to uncompress gzip'ed files. The .gz files created by pigz are compatible with gzip/gunzip. The pigz command is particularly helpful to compress a large number of files (or a folder) or to compress large files.


The compute nodes on Sapelo2 and on the teaching cluster have pigz installed centrally, so you don't need to load any modules in order to use this command. The help page for this command shows the available options, and it can be viewed with the command
The compute nodes on Sapelo2 and on the teaching cluster have pigz installed centrally, so you don't need to load any modules in order to use this command. The help page for this command shows the available options, and it can be viewed with the command
Line 12: Line 14:
</pre>
</pre>


'''Some simple examples'''
<blockquote style="background-color: lightyellow; border: solid thin grey;">
'''Note:''' Please do not run <code>gzip/gunzip/pigz/unpigz</code> commands directly on the Sapelo2 login nodes. Instead, please use either an interactive or a batch job for compressing and uncompressing files. If we detect these commands being run on the login nodes, we may have to cancel them to avoid overloading the login nodes.
</blockquote>
 
===Some simple examples===


Compress a file
Compress a file
Line 31: Line 37:




We suggest that you run <code>pigz</code> on an interactive session that request multiple cores and run <code>pigz</code> with the '-p num_thread' option to specify the numnber of threads (num_threads) to use.  
We suggest that you run <code>pigz</code> in an interactive job that requests multiple cores and run <code>pigz</code> with the '-p num_thread' option to specify the numnber of threads (num_threads) to use.  


For example, start an interactive session with 10 cores and 4GB of RAM with
For example, start an interactive session with 10 cores and 4GB of RAM with
Line 39: Line 45:
and then run pigz with 10 threads with
and then run pigz with 10 threads with
<pre class="gcommand">
<pre class="gcommand">
pigz -9 -p 10 my_big_file
pigz --best -p 10 my_big_file
</pre>
</pre>


To recursive compress all files in a directory (e.g. called dirname) use the -r option. For example, using 10 threads
To recursive compress all files in a directory (e.g. called dirname) use the -r option. For example, using 10 threads
<pre class="gcommand">
<pre class="gcommand">
pigz -9 -r -p 10 dirname
pigz --best -r -p 10 dirname
</pre>
</pre>


Line 51: Line 57:
unpigz -r dirname
unpigz -r dirname
</pre>
</pre>
===Sample timing comparison===
This example shows the time required to compress a single 10GB file in an interactive session that has 10 cores available. The gzip command took about 84 seconds, while the pigz command took about 22 seconds, to compress a 10GB file down to 10MB (gzip) and 12MB (pigz).
<pre class="gcommand">
[shtsai@ss-sub3]$ interact -c 10
srun --pty  --cpus-per-task=10 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l
[shtsai@c2-17]$ ls -lh 10g.img
-rw-r--r-- 1 shtsai gclab 10G Dec 16  2021 10g.img
[shtsai@c2-17]$ time gzip 10g.img
real 1m24.437s
user 1m19.214s
sys 0m4.932s
[shtsai@c2-17]$ ls -lh 10g.img.gz
-rw-r--r-- 1 shtsai gclab 10M Dec 16  2021 10g.img.gz
[shtsai@c2-17]$ time gunzip 10g.img.gz
real 1m23.511s
user 1m3.506s
sys 0m18.855s
[shtsai@c2-17]$ ls -lh 10g.img
-rw-r--r-- 1 shtsai gclab 10G Dec 16  2021 10g.img
[shtsai@c2-17]$ time pigz --best -p 10 10g.img
real 0m22.028s
user 1m39.639s
sys 0m9.134s
[shtsai@c2-17]$ ls -lh 10g.img.gz
-rw-r--r-- 1 shtsai gclab 12M Dec 16  2021 10g.img.gz
[shtsai@c2-17 shtsai]$ time unpigz 10g.img.gz
real 0m36.700s
user 0m45.593s
sys 0m21.935s
[shtsai@c2-17]$ ls -lh 10g.img
-rw-r--r-- 1 shtsai gclab 10G Dec 16  2021 10g.img
</pre>
===References===
* pigz home page: https://zlib.net/pigz/
* pigz manual page: https://zlib.net/pigz/pigz.pdf
* gzip manual: https://www.gnu.org/software/gzip/manual/gzip.html




Line 57: Line 119:
Having a large number of files in a file system can overload the storage metadata server and delay data recovery from backups, etc. If you need to store a large number of files in your group's /project area, instead of storing a large number of individual files, please first create a tar file with the files or with a directory, and transfer the tar file to /project.
Having a large number of files in a file system can overload the storage metadata server and delay data recovery from backups, etc. If you need to store a large number of files in your group's /project area, instead of storing a large number of individual files, please first create a tar file with the files or with a directory, and transfer the tar file to /project.


The <code>tar</code> command can be run in interactive job on Sapelo2 (please do not run <code>tar</code> directly on the Sapelo2 login nodes) with
The <code>tar</code> command can be run in interactive job on Sapelo2 with
<pre class="gcommand">
<pre class="gcommand">
tar cvf dirname.tar dirname
tar cvf dirname.tar dirname
</pre>
</pre>
<blockquote style="background-color: lightyellow; border: solid thin grey;">
'''Note:''' Please do not run <code>tar</code> directly on the Sapelo2 login nodes. Instead, please use either an interactive or a batch job for creating or extracting archive files.
</blockquote>


A tar file can be compressed with <code>pigz</code> using multiple cores in an interactive session that requested multiple cores with
A tar file can be compressed with <code>pigz</code> using multiple cores in an interactive session that requested multiple cores with
Line 73: Line 139:
</pre>
</pre>


To extract a tar.gz file:
To extract the files from a tar.gz file:
<pre class="gcommand">
<pre class="gcommand">
tar zxvf dirname.tar.gz  
tar zxvf dirname.tar.gz  
</pre>
</pre>
==File transfer==
The recommended way to transfer files between file systems on the cluster, or between GACRC and external storage systems is using Globus. For more information, please see https://wiki.gacrc.uga.edu/wiki/Globus

Latest revision as of 13:11, 25 September 2024

To help us optimize storage space and improve the /project file system performance, we kindly ask you to compress your files into archives (e.g., tar files) rather than storing numerous individual files in the /project file system. Compressing files not only reduces the amount of storage used but also simplifies file management (e.g. file back up and recovery) and transfer.


File Compression

In order to save space, please compress your files before transferring them into your group's /project file system. The gzip command can be used to compress files, but it uses a single thread on a single core.

The pigz command is a parallel implementation of gzip that can run with multiple threads, making use of multiple cores. The unpigz command is equivalent to gunzip and it can be used to uncompress gzip'ed files. The .gz files created by pigz are compatible with gzip/gunzip. The pigz command is particularly helpful to compress a large number of files (or a folder) or to compress large files.

The compute nodes on Sapelo2 and on the teaching cluster have pigz installed centrally, so you don't need to load any modules in order to use this command. The help page for this command shows the available options, and it can be viewed with the command

pigz --help

Note: Please do not run gzip/gunzip/pigz/unpigz commands directly on the Sapelo2 login nodes. Instead, please use either an interactive or a batch job for compressing and uncompressing files. If we detect these commands being run on the login nodes, we may have to cancel them to avoid overloading the login nodes.

Some simple examples

Compress a file

pigz filename

Compress a file with best compression rate

pigz -9 filename
pigz --best filename

Uncompress a file

unpigz filename.gz


We suggest that you run pigz in an interactive job that requests multiple cores and run pigz with the '-p num_thread' option to specify the numnber of threads (num_threads) to use.

For example, start an interactive session with 10 cores and 4GB of RAM with

interact -c 10 --mem=4g 

and then run pigz with 10 threads with

pigz --best -p 10 my_big_file

To recursive compress all files in a directory (e.g. called dirname) use the -r option. For example, using 10 threads

pigz --best -r -p 10 dirname

To recursive uncompress all files in a directory:

unpigz -r dirname


Sample timing comparison

This example shows the time required to compress a single 10GB file in an interactive session that has 10 cores available. The gzip command took about 84 seconds, while the pigz command took about 22 seconds, to compress a 10GB file down to 10MB (gzip) and 12MB (pigz).

[shtsai@ss-sub3]$ interact -c 10

srun --pty  --cpus-per-task=10 --job-name=interact --ntasks=1 --nodes=1 --partition=inter_p --time=12:00:00 --mem=2GB /bin/bash -l

[shtsai@c2-17]$ ls -lh 10g.img 
-rw-r--r-- 1 shtsai gclab 10G Dec 16  2021 10g.img

[shtsai@c2-17]$ time gzip 10g.img

real	1m24.437s
user	1m19.214s
sys	0m4.932s

[shtsai@c2-17]$ ls -lh 10g.img.gz 
-rw-r--r-- 1 shtsai gclab 10M Dec 16  2021 10g.img.gz

[shtsai@c2-17]$ time gunzip 10g.img.gz 

real	1m23.511s
user	1m3.506s
sys	0m18.855s

[shtsai@c2-17]$ ls -lh 10g.img
-rw-r--r-- 1 shtsai gclab 10G Dec 16  2021 10g.img

[shtsai@c2-17]$ time pigz --best -p 10 10g.img 

real	0m22.028s
user	1m39.639s
sys	0m9.134s

[shtsai@c2-17]$ ls -lh 10g.img.gz 
-rw-r--r-- 1 shtsai gclab 12M Dec 16  2021 10g.img.gz

[shtsai@c2-17 shtsai]$ time unpigz 10g.img.gz 

real	0m36.700s
user	0m45.593s
sys	0m21.935s

[shtsai@c2-17]$ ls -lh 10g.img 
-rw-r--r-- 1 shtsai gclab 10G Dec 16  2021 10g.img

References


Creating tar files

Having a large number of files in a file system can overload the storage metadata server and delay data recovery from backups, etc. If you need to store a large number of files in your group's /project area, instead of storing a large number of individual files, please first create a tar file with the files or with a directory, and transfer the tar file to /project.

The tar command can be run in interactive job on Sapelo2 with

tar cvf dirname.tar dirname

Note: Please do not run tar directly on the Sapelo2 login nodes. Instead, please use either an interactive or a batch job for creating or extracting archive files.

A tar file can be compressed with pigz using multiple cores in an interactive session that requested multiple cores with

pigz dirname.tar

Alternatively, you could use pigz to compress the files in your directory, before creating a tar file.

To extract the files from a tar file:

tar xvf dirname.tar 

To extract the files from a tar.gz file:

tar zxvf dirname.tar.gz 


File transfer

The recommended way to transfer files between file systems on the cluster, or between GACRC and external storage systems is using Globus. For more information, please see https://wiki.gacrc.uga.edu/wiki/Globus