1. Client Configuration and Backup

1.1. Overview

The sysadmin controls the configuration of the system to-be-backed-up, which is termed generically as the storage server. This guide assumes that you have completed the client installation and have your organization’s customized aws-settings.yaml in the appropriate location

Overview of Tasks

  1. Localize your server so that it properly can refer to AWS-defined elements

  2. Configure jobs.yaml to reflect your backup jobs

  3. Initial testing of your backup configuration

  4. Install cron entries to regularly perform sync and top-up backups

  5. Start the initial seeding of your backup

1.2. Localize the Storage Server

This step needs to be done in lockstep with the cloudadmin. The cloudadmin on boards and transmits critical username and password information. The cloudadmin and sysadmin need to have identical copies of aws-settings.yaml.

Systems are defined by:
owner
system name

Attention

The sysadmin and cloudadmin must use the identical words for owner and system name. For the examples, this guide will use panteater as the owner and labstorage as the system name.

To enroll your system to backup into AWS, the sysadmin uses the localize.py script.

cd $RCS3_ROOT/rcs3/POC/sysadmin
./localize.py panteater labstorage
Enter AWS Access Key:
Enter AWS Secret Access Key:

The cloudadmin will provide you the AWS Access Key and Secret Access Key. These are generated specifically for your server when the cloudadmin on boards your server.

The following four files are written only if they do not already exist. You can redo the localization by removing any subset of these files for which you need to re-localize:

$RCS3_ROOT/rcs3/POC/config/credentials
$RCS3_ROOT/rcs3/POC/config/rclone.conf
$RCS3_ROOT/rcs3/POC/config/weekly-backup
$RCS3_ROOT/rcs3/POC/config/daily-backup

The credentials file holds long-term username/password so that rclone can interact with your server-specific backup-bucket. Both credentials and rclone.conf have permissions changed so that only the owner (usually root) can access them.

Note

The cloudadmin can regenerate credentials for the specific AWS service account that performs the backup. If these credentials are lost (or compromised), the backup can still be made accessible.

Note

Credentials are rotated automatically after the completion of every backup job. In this sense long-term credentials are valid from the conclusion of the previous backup job through the completion of the current backup job. These structure prevents credentials from expiring during an active backup session.

1.3. Create jobs.yaml

While rclone is the workhorse software that performs that backup, the program gen-backup.py is used to handle some of the more arcane command-line parameters, create rclone filters to select and exclude files/directories from backup, runs rclone itself, and optionally notifies the sysadmin of start/completion of sync backup jobs.

Backup jobs are defined in the file config/jobs.yaml, which does NOT exist on a first time install. The very first step is to copy a template jobs.yaml file and then edit to reflect your specific server configuration:

cd $RCS3_ROOT/rcs3/POC
cp templates/jobs.yaml config/jobs.yaml

The file config/jobs.yaml (or just jobs.yaml) is excluded from git so that your local changes can never be overwritten by a git pull (update of RCS3 itself). The following template is an example file:

## This is a sample jobs.yaml the defines four backup jobs on a single path
## Job names must be distinct (not checked, yet)
## Path is common to the jobs relative to the path
---
srcpaths:
- path: /datadir
  ## Local decision to exclude .git subdirs
  exclude_global:
    - .git/**

  ## Patterns from a file to exclude
  exclude_file: common_excludes.yaml

  jobs:
    - name: backup1
      subdirectories:
        - DataImages

    - name: backup2 
      subdirectories:
        - commondata 

jobs.yaml is yaml-formatted with all of the specialized-formatting requirements. There should be no tabs in the file and indentation is very specific.

Let’s describe the major portions of the file

path: /

The indicates the jobs defined below have included directories relative to this path.

exclude_global:

A list of rclone-compatible filter specifications that will be excluded from every backup job. In the case, a local decision is made to ignore the contents of all .git subdirectories.

exclude_file: common_excludes.yaml

This file in config/common_excludes.yaml is list of rclone filters to excludes common patterns that should never be backed up. This file can be updated by a git pull. If you would like your own version of common_excludes.yaml, copy to a new file name in the config/ directory and then change jobs.yaml to reflect your customized version.

jobs:

relative the path above, you can create multiple backup jobs. Job names (the name: key) need to be unique among all backup jobs for this server. There are sound reasons to define multiple backup jobs. For example, the file system has many files and practicality demands breaking-up the backup into more manageable chunks.

subdirectories:

This is a bullet list of subdirectories in include. In the sample /volume1 is being backed up as job backup1

excludes:

This is a bullet list of patterns (defined as rclone filters) to not backup.

1.4. Initial Testing of jobs.yaml

list

It’s always a good to test if jobs.yaml is syntactically correct and looks reasonable. The list command in gen-backup.py will provide the set of jobs that will be run:

$RCS3_ROOT/rcs3/POC/sysadmin/gen-backup.py list
rcs3config /.rcs3/rcs3/POC/config
backup1 /
This indicates two backup jobs:
rcs3config
backup1

The first job is implicit so that the jobs.yaml file is recorded in AWS. The second job (backup1) is the name of the job defined explicitly and indicates that the path to be backed up is /. Note the details of what will included in backup1 are not included in this brief listing.

detail

This command gives the full detail of the rclone filter that will be applied and the rclone command that would be executed:

$RCS3_ROOT/rcs3/POC/sysadmin/gen-backup.py detail
rcs3config /.rcs3/rcs3/POC/config
== filter contents (output to: /tmp/rcs3config.filter) ==
+ jobs.yaml
- **
== command ==
rclone --config /.rcs3/rcs3/POC/config/rclone.conf \
--s3-shared-credentials-file /.rcs3/rcs3/POC/config/credentials \
--metadata --links --transfers 2 --checkers 32 --log-level INFO \
--log-file /tmp/rcs3config.log --filter-from /tmp/rcs3config.filter sync \
/.rcs3/rcs3/POC/config s3-backup:rcs3config/.rcs3/rcs3/POC/config
=============
backup1 /
== filter contents (output to: /tmp/backup1.filter) ==
- .git/**
- .zfs/**
- .snapshot/**
- .vscode/**
- .DS_Store/**
- #snapshot/**
- #recycle/**
- @eaDir/**
- .plist/**
- .strings
- .cprestoretmp.*
- .part
- .tmp
- .cache/**
- .Trash*/**
- Google/Chrome/.*cache.*
- Google/Chrome/Safe Browsing.*
- iPhoto Library/iPod Photo Cache/**
- Mozilla/Firefox/.*cache.*
- Music/Subscription/.*
+ volume1/**
- **
== command ==
rclone --config /.rcs3/rcs3/POC/config/rclone.conf \
--s3-shared-credentials-file /.rcs3/rcs3/POC/config/credentials \
--metadata --links --transfers 2 --checkers 32 --log-level INFO \
--log-file /tmp/backup1.log --filter-from /tmp/backup1.filter \
sync / s3-backup:backup1/
=============

There are some key items to take note:

  • /tmp/backup1.filter - is the generated rclone filter file, and its contents are displayed.

  • + volume1/** - is the entry that shows path volume1 to include under /.

  • - ** - is the entry that specifies all other top-level files and directories under / to be excluded.

  • - .tmp - are all other exclusions to apply to any level in the selected folders.

  • == command == - is the

  • == command == - is the final stanza shows the full rclone command that would be executed when the backup job backup1 actually runs.

    For the backup1 job, sync / s3-backup:backup1/ is the command given to rclone where

    • / - is the source path

    • s3-backup:backup1/ is the destination

    • s3-backup is an rclone remote defined when localize.py was executed.

Note

rclone’s log of when it runs is shown with the –log-file (e.g. /tmp/backup1.log) argument.

1.5. Install Cron Entries

The templates/crond.sample is starting point that should be customized to your desires a sample below with lines broken up for readability:

# Run a full sync Sunday (Day 0) and 1am
0 1 * * 0 /.rcs3/rcs3/POC/sysadmin/weekly-backup &

# run top syncs M-Sa (Days 1-6)at 1am
0 1 * * 1-6 /.rcs3/rcs3/POC/sysadmin/daily-backup &

This is the first exposure to sync vs top-up backups and the difference is critical to containing cost and improving performance. When localize.py was executed the weekly and daily backup scripts were created. The scripts can be edited to modify parameters to gen-backup.py

Weekly-backup:

The weekly-backup contains a line very similar to:

/.rcs3/rcs3/POC/sysadmin/gen-backup.py --threads=2 --checkers=32 \
 --owner=panteater --system=labstorage run > /var/log/gen-backup.log 2>&1

The weekly backup is a sync

sync

This compares the contents of the server and the backup. Updated/New files are uploaded. Deleted files are removed from the backup. This translates to an AWS API call (head) for every single file in the backup. For time efficiency, rclone can have multiple outstanding “check requests” in flight. That number is governed by the –checkers. For backups with tens of millions of files, setting to larger number –checkers=128 results in roughly 2000 file checks/second (about 1M checks/10 minutes)

Daily-backup:

The daily-backup looks very similar but one important difference

/.rcs3/rcs3/POC/sysadmin/gen-backup.py --threads=2 --checkers=32 \
--owner=panteater --system=labstorage --top-up=24h run >> /var/log/gen-backup.log 2>&1

Daily adds the parameter –top-up:

top-up

This scans the local file system only for any new/changed files in the top-up window (24 hours in the example) Deleted files are NOT removed from the backup. This is inexpensive because (1) only new data is uploaded (2) the head API call of sync is not made on all existing files.

The two sample cron entries have comments as to day and time-of-day that sync (once per week) and top-up (6 days/week) will run. A simple lock file is written to ensure that two versions of gen-backup.py do not run at the same time.

Note that the following two items in the sync entry should have been changed when you localized the storage server

  • –owner=panteater. Change panteater to the owner of the system being backed up

  • –system=labstorage. Change labstorage to the name of the system being backed up

1.5.1. Install the Crontab

Execute the command

crontab -e

and paste the contents of the sample crontab setup. You can change time and days of the week that your backups run. Please see the crontab man page for more details on the format and meaning of cron entries.

1.6. Run the Initial Backup

You could stop at step 5.1 above and simply wait until cron performed its first sync, but that is not recommended. Instead, run the sync version of the backup

/.rcs3/rcs3/POC/config/weekly-backup &

You can follow the progress of the backup by tailing rclone’s log file, e.g:

tail -f /tmp/backup1.log

Attention

It can take days to weeks to seed the first backup. The length of time depends on file system performance, network connectivity to AWS, total volume of data, and total number of files to backup. The rclone log file shows transfer performance every minute. You can use this to estimate expected duration.

1.7. Advanced Options

In this section we describe two advanced options: using rclone directly and client-side encryption.

1.7.1. Using Rclone Directly

gen-backup.py ultimately spins off rclone via python’s subprocess module. Calling gen-backup.py with the rclone argument will print out rclone command and all flags utilized. E.g.:

./gen-backup.py rclone
rclone --config /.rcs3/rcs3/POC/config/rclone.conf \
--s3-shared-credentials-file /.rcs3/rcs3/POC/config/credentials \
--metadata --links --transfers 2 --checkers 32

You could cut and paste this directly, but a more convenient method is shown in the example below where rclone’s listremotes command is used:

$(./gen-backup.py rclone) listremotes
s3-backup:
s3-crypt:
s3-inventory:
s3-native:

You can now use any rclone command but should only limit to commands that make no changes. A particularly convenient command is serve http so that you could use a web browser to view what is stored in the backup.

It is recommended that you only serve to localhost and use an alternate port. An example of serving data to localhost over port 8080 in a read-only manner:

$(./gen-backup.py rclone) serve http --read-only --addr localhost:8080 s3-backup:

Point your browser to http://localhost:8080 to view the contents of your backup. If you are familiar with ssh tunneling, it’s not difficult to view remotely.

Windows

The windows installation uses fully-localized versions of git, python, aws, and rclone. RCS3 provides the wrapper Powershell script rclone.ps1. The listremotes example above in Powershell looks like:

./rclone.ps1 listremotes
s3-backup:
s3-crypt:
s3-inventory:
s3-native:

In the examples above, you can replace $(./gen-backup.py rclone) with ./rclone.ps1

1.7.2. Client Side Encryption

Data is stored in S3 in such a way that cloudadmins can view (through download) the contents of any file stored in S3. At most universities, existing policy bars them from doing that without the data owners knowledge or consent. Some data use agreements might demand that data be encrypted on the backup so that only the encryption key holder could view the unencrypted contents.

To address concerns that might arise from the above (or any other rationale for storing data in an encrypted form in the backup), the sysadmin can secure all file content data prior to transmission by defining a private encryption key. There are some important facts when using a private encryption key:

  • The private key is only known to the holder. If it is lost, no one can assist in recovery. The private key is not known by cloudadmins. Private key owners must backup their key

  • The private key cannot be rotated without re-uploading new versions of files. Rclone will encrypt a file prior to uploading it into S3.

  • Encrypted and unencrypted files can co-exist in the same backup bucket

Setup is a few steps:

In the examples below use your system specific method for invoking rclone directly. The examples, when appropriate show the Linux variant. The assumption is that you are in the sysadmin directory.

1. Define encryption key

Use rclone natively to define an encryption key on the s3-crypt endpoint.

$(./gen-backup.py rclone) config update s3-crypt --all
Take defaults for all questions, have rclone generate the password and the salt,
do NOT edit advanced configuration. The rclone page on crypted remotes provides details.
Remember to record both the generated password and salt password

Warning

Save the passwords that were generated in a safe place like Bitwarden or 1Password.
If you lose this password, no one can restore your data.
2. Recommended: Rename your backup job.

Edit jobs.yaml and adjust the job name to reflect that a particular backup job is encrypted. Choose a name like backup1-encrypted.

The following shows the changed contents of an example jobs.yaml to rename the existing backup job backup1 to backup1-encrypted:

Before editing

After editing

jobs:
  - name: backup1
    subdirectories:
      - Users/phili/Documents
jobs:
  - name: backup1-encrypted
    subdirectories:
      - Users/phili/Documents
3. Change the remote that gen-backup.py uses.

Add the --endpoint=s3-crypt argument to your invocation of gen-backup-py command to override the default of s3-backup. Don’t forget to update your crontab (or Windows Scheduled Task) entries.

With the steps above, your data will be encrypted at the source with the passwords that were generated.

Here’s an example session:

$(./gen-backup.py rclone) lsd s3-backup:             (1)
        0 1999-12-31 16:00:00        -1 backup1
        0 1999-12-31 16:00:00        -1 rcs3config
gen-backup.py  --endpoint=s3-crypt run               (2)
=== rcs3config sync started at 2024-04-18 15:43:54.933541
=== backup1-encrypted sync started at 2024-04-18 15:43:54.933541
=== rcs3config completed at 2024-04-18 15:43:56.029525
=== backup1-encrypted completed at 2024-04-18 15:44:42.001963
All tasks completed.
$(./gen-backup.py rclone) lsd s3-backup:             (3)
        0 1999-12-31 16:00:00        -1 backup1-encrypted
        0 1999-12-31 16:00:00        -1 backup
        0 1999-12-31 16:00:00        -1 rcs3config
(1) Listing the top-level directories of the backup prior to performing an encrypted backup
(2) Backup data in an encrypted manner
(3) Listing the top-level directories of the backup after the performing the encrypted backup

Note

Please notice the dates of 1999-12-31 on the directories. This is an artifact of S3 in that directories are not objects but are just string prefixes with no metadata. Rclone is building support for full metadata on directories at the expense of storing another object.