institute of biotechnology >> brc >> bioinformatics >> internal >> biohpc lab: user guide
 

BioHPC Lab:
User Guide

 


Overview


    The purpose of our backup system is to create and periodically update snapshots of selected directories while retaining, for some time, copies of files that have been deleted or changed.

    In order to use our backup system, the user needs to purchase backup storage (see pricing). For increased data safety, the backup storage servers are located in Weill Hall, separately from the rest of our computational infrastructure.

    After purchasing backup storage, the user specifies one or more directories they wish to back up. Each such directory becomes a backup root. A typical example would be your home directory, although it is also possible to specify other directories, such as a subdirectory of your home directory, or a directory located on one of the hosted servers. Each backup root is backed up entirely (reccursively with all files and subdirectories) except subdirectories or files explicitly excluded.

    When backup of a given directory is being done for the first time, the entire directory (except exclusions) will be copied to the backup server, i.e., its current snpashot will be created reflecting the directory's state at backup time. Next time the backup runs, this current snapshot will be updated, i.e., files removed, added, or changed by the user in the meantime in the source directory will be also removed from, added to, or changed in the current snapshot. However, the files that have been removed, as well as previous versions of those that changed, will be saved on the backup server in a backup snapshot labeled with backup date and time. The backup snapshot contains only files that have been changed or removed by the user from the source directory since the previous backup cycle. Subsequent backup cycles will update the current snapshot, create new, dated backup snapshots, and remove the older ones. This process is illustarted in the figure below.



    Thus, the backup server will always contain the current snpashot, reflecting the state of the directory from before the latest backup, plus a number of dated backup snapshots containing files changed or removed between previous backup cycles. Multiple snapshots facilitate retrieval of old versions of all files, whenever needed. The maximum age of the backup snapshots to be kept as well as backup frequency are configurable by the user.

Parameters controling backup


    Backup is controled by three parameters, set individually for each backup root directory:

  • Retention: age (in days) of the oldest version of the backup root directory to be kept
  • Frequency: backup frequency (e.g., setting this to 3 means backup of this directory will be run every 3 days)
  • MinSave: minimum number of old versions of the directory to be saved always, regardless of age (prevents a possibility of all previous versions being erased if the original directory is not changed for longer than Retention days)

  • Besides setting these parameters, the user can also specify exclusions - files and/or subdirectories of the backup root to be omitted from the backup process.

Typical backup scenarios


    Depending on your needs, you may consider two basic backup strategies: back up most, exclude some and back up some, exclude most.

  • back up most, exclude some: Specify some top-level directory (such as your homoe directory) as backup root, possibly with a few exclusions. The advantage is that all changes you make to this directory (except excluded parts) will be reflected in the backup without you taking any extra effort. However, if you add some large files which you did not really intend to back up but forget to exclude them, they will be copied to the backup server and you will be charged for space-time they occupy.
  • back up some, exclude most: Backup only one (or more) individual subdirectory of your home directory, the content of which you consier most important. To do this, you need to specify this subdirectory (rather than your entire home directory) as backup root. The advantage is that changes you make outside of backup root will not junk up the backup. However, if any of these changes are important but you forget to copy or move them into backup root, these changes will not be reflected in the backup.

How to purchase backup storage





  • First time users must start by purchasing backup storage by clicking on the Purchase Backup Credit button on the bottom of the My Storage page.
  • Backup storage is purchased in 1 TB-year increments, similarly as our main storage. How long your purchased storage will last depends on the backup size - it is similar to storage-quota relation (see the bottom of main storage page for details). This means if you purchase 1TB-year of backup storage and your backup size is 0.5 TB then your 1TB-year of purchased backup storage will last 2 years. If you backup size is 2TB then your 1TB-year purchase will expire after 6 months.
  • Backup storage used to date is calculated daily and reported on your My Storage page. The remaining backup storage is recomputed accordingly.
  • A default name is given to your new Backup Credit Account after you accept the purchase and an invoice is created (the name may be changed after the purchase from the status table on the My Storage page).


Specify backup root directories


  • Click on the Manage Backup button on the bottom of the My Storage page.



  • Enter the directory you want to backup (backup root) in the text box and click on the Add Directory to Backup button
  • Use the default Server: Network Storage to specify directories which start with /home.
  • You may change "Network Storage" in the Server text box to the name of any specific server which you can access and where you have files that require backup. This typically applies to BioHPC Lab hosted servers.
  • Once a backup root directory has been added, you will be able to edit the Retention, Frequency and MinSave parameters (click the Edit button) and add or remove exlcusions (click Manage Excludes). Click Stop Backup to stop backing up this directory and remove it from your list of backup roots (of course, this operation does not in any way affect the source directory).


  • Repeat the above steps for all directories you would like backed up.

Exclude directories and/or files from backup roots


  • Click on the Manage Excludes button to list the content of the backup root directory
  • Click on the Exclude checkbox to exclude a file or a subdirectory from backup
  • Enter a subdirectory by clicking on its link, then exclude files and/or subdirectories within it, etc.
  • Exclusions can be removed by clicking on the Remove Exclude button, or by un-checking a box on the directory listing.



Checking the status of your backup account


    Once the backup root directories are configured, the My Storgae page will contain the summary of your backup storage account, updated daily. Check this page regularly. You will be notified by e-mail when your purchased backup storage is about to run out. If needed, purchase more credit, or reduce the backup size by adding more exclusions or removing backup roots you no longer need (Manage Backup button).



Accessing your backup


    Backup directories are exported from backup server and mounted on our login nodes, cbsulogin.tc.cornell.edu and cbsulogin2.tc.cornell.edu. Each user-specified backup root has a corresponding location under /backups/backup1 on both login nodes. This location reflects the owner, source server, and backup root. The picture below shows three examples, with different parts of the path color-coded for clarity.



    The first backup root is one user's home directory located on Network Storage. The second is an example of a storage group space, also located on Network Storage under /home. The last backup root is a directory located on a hosted server cbsubscb02.

    Each of these locations is, in turn, organized in current snapshot and backup snapshot directories. For example, listing the content of the first of the directories above will show output similar to



    The directory current contains the current snapshot, whereas the bak_* directories (each marked with the date) contain files changed or deleted between the date of the directory and the backup cycle preceding it. The current directory and each of the bak_* directories contain the actual files and directories being backed up; in the example above - the directory home/bukowski and its backed up content.

    The files on the backup mounts can be listed, browsed, and looked into using regular Linux commands (cd, less, cat, text editors) or graphical File Manager tool (if connected to cbsulogin or cbsulogin2 via VNC). Access permissions are the same as those on the source directories, except that the write permission is always revoked. The same tools can be used to retrieve files from backup (just copy the files you want from backup directories to wherever you need them).

Space considerations


    How much disk space will your backup take on backup server? It all depends on how many changes your source directory undergoes each backup cycle. If changes (i.e., size of files being added, deleted, or changed) are small, the bulk of the backup will be concentrated in the current snapshot, while the bak_* snapshots (each containing only changes) will be negligibly small. In such a case your backup size will be close to the size of the source directory. On the other hand, if a lot of changes are made every day, the size of each of the bak_* directories may become close to the size of the current snapshot, in which case the total size of your backup will be about [size of the source directory] X (Retention/Frequency+1). In practice, of course, the actual size will be somewhere between these two extremes.

Few words of caution


  • Avoid triggering big backup events. A big (and mostly unnecessary) backup event may happen if one or more large files are moved around between subdirectories or just renamed. If a large file within backup root is moved or renamed, it will be transferred (with the new name) to the current snapshot on backup server while its previous copy (with old name, but otherwise identical) will be saved in a bak_* snapshot. The result: extra network traffic during backup and doubled backup storage charge.
  • Avoid backing up the same directory multiple times. This may happen, for example, when you back up your directory located within your lab's storage group which itself is already backed up entirely (i.e., without exclusions) by your lab manager.
  • The backup does not follow symlinks. If a directory you are backing up contains symbolic links (shortcuts) to files located elsewhere, these files will not be backed up unless they are included explicitly in some backup root directory.

My Storage page

 

Website credentials: login  Web Accessibility Help