Rclone Storage bucket sync using Cloud Scheduler and Cloud Run

2020-11-10

Basic tutorial ty synchronize contents of two Object Storage buckets using Rclone on Google Cloud.

The specific implementation here uses GCS Buckets as the source and destination but it would be relatively simple extension to other backend storage providers. For example, an AWS S3 configuration as source bucket is shown in the Appendix section.

This repository is NOT supported by Google.

This utility has not been tested at scale. Since the synchronization iterates files, it will not scale with very large numbers of objects or where each file transfer exceeds the default timeout of Cloud Run (15mins). However, there are techniques you can use to shard the iteration using Cloud Tasks across various Cloud Run instances. This is described at the end of the tutorial.


You can find the source here

Rclone Storage bucket sync using Cloud Scheduler and Cloud Run

Architecture

This tutorial will setup Rclone source and destination configurations on Cloud Run.

Cloud Scheduler is configured make authenticated http triggers on Cloud Run to initiate synchronization

images/arch.png

For more information, see Automatic OIDC: Using Cloud Scheduler, Tasks, and PubSub to make authenticated calls to Cloud Run, Cloud Functions or your Server

The Cloud Run instance uses GCS as the only Rclone FileSystem provider, in server.go:

	// Configure the source and destination
	fs.ConfigFileSet("gcs-src", "type", "google cloud storage")
	fs.ConfigFileSet("gcs-src", "bucket_policy_only", "true")

As mentioned, you can configure other source and destinations Rclone supports as well.

Setup

  1. Configure Service Accounts Cloud Run and Cloud Scheduler:
export PROJECT_ID=`gcloud config get-value core/project`
export PROJECT_NUMBER=`gcloud projects describe $PROJECT_ID --format="value(projectNumber)"`
export REGION=us-central1

export RSYNC_SERVER_SERVICE_ACCOUNT=rsync-sa@$PROJECT_ID.iam.gserviceaccount.com
export RSYNC_SRC=$PROJECT_ID-rsync-src
export RSYNC_DEST=$PROJECT_ID-rsync-dest

gcloud iam service-accounts create rsync-sa --display-name "RSYNC Service Account" --project $PROJECT_ID

export SCHEDULER_SERVER_SERVICE_ACCOUNT=rsync-scheduler@$PROJECT_ID.iam.gserviceaccount.com

gcloud iam service-accounts create rsync-scheduler --display-name "RSYNC Scheduler Account" --project $PROJECT_ID
  1. Configure source and destination GCS Buckets

Configure Uniform Bucket Access Policy

gsutil mb gs://$RSYNC_SRC
gsutil mb gs://$RSYNC_DEST

gsutil iam ch serviceAccount:$RSYNC_SERVER_SERVICE_ACCOUNT:objectViewer gs://$RSYNC_SRC
gsutil iam ch serviceAccount:$RSYNC_SERVER_SERVICE_ACCOUNT:objectViewer,objectCreator,objectAdmin gs://$RSYNC_DEST
  1. Build and deploy Cloud Run image

The server.go as an extra secondary check for the audience value that the Cloud Scheduler sends. This is not a necessary step since Cloud Run checks the audience value by itself automatically (see Authenticating service-to-service).

This secondary check is left in to accommodate running the service on any other platform.

To deploy, we first need to find out the URL for the Cloud Run instance.

First build and deploy the cloud run instance (dont’ worry about the AUDIENCE value below)

docker build -t gcr.io/$PROJECT_ID/rsync  .

docker push gcr.io/$PROJECT_ID/rsync

gcloud beta run deploy rsync  --image gcr.io/$PROJECT_ID/rsync \
  --set-env-vars AUDIENCE="https://rsync-random-uc.a.run.app" \
  --set-env-vars GCS_SRC=$RSYNC_SRC \
  --set-env-vars GCS_DEST=$RSYNC_DEST \
  --region $REGION  --platform=managed \
  --no-allow-unauthenticated \
  --service-account $RSYNC_SERVER_SERVICE_ACCOUNT

Get the URL and redeploy

export AUDIENCE=`gcloud beta run services describe rsync --platform=managed --region=$REGION --format="value(status.address.url)"`

gcloud beta run deploy rsync  --image gcr.io/$PROJECT_ID/rsync \
  --set-env-vars AUDIENCE="$AUDIENCE" \
  --set-env-vars GCS_SRC=$RSYNC_SRC \
  --set-env-vars GCS_DEST=$RSYNC_DEST \
  --region $REGION  --platform=managed \
  --no-allow-unauthenticated \
  --service-account $RSYNC_SERVER_SERVICE_ACCOUNT

Configure IAM permissions for the Scheduler to invoke Cloud Run:

gcloud run services add-iam-policy-binding rsync --region $REGION  --platform=managed \
  --member=serviceAccount:$SCHEDULER_SERVER_SERVICE_ACCOUNT \
  --role=roles/run.invoker
  1. Deploy Cloud Scheduler

First allow Cloud Scheduler to assume its own service accounts OIDC Token:

envsubst < "bindings.tmpl" > "bindings.json"

Where the bindings file will have the root service account for Cloud Scheduler:

  • bindings.tmpl:
{
  "bindings": [
    {
      "members": [
        "serviceAccount:service-$PROJECT_NUMBER@gcp-sa-cloudscheduler.iam.gserviceaccount.com"
      ],
      "role": "roles/cloudscheduler.serviceAgent"
    }    
  ],
}

Assign the IAM permission and schedule the JOB to execute every 5mins:

gcloud iam service-accounts set-iam-policy $SCHEDULER_SERVER_SERVICE_ACCOUNT  bindings.json  -q

gcloud beta scheduler jobs create http rsync-schedule  --schedule "*/5 * * * *" --http-method=GET \
  --uri=$AUDIENCE \
  --oidc-service-account-email=$SCHEDULER_SERVER_SERVICE_ACCOUNT   \
  --oidc-token-audience=$AUDIENCE

Verify

Test by uploading a set of files to RSYNC_SRC bucket and in 5minutes check RSYNC_DEST for the same files. For the impatient, you can click “Run Now” button on Cloud Scheduler Cloud Console screen.


Appendix

The following section details some options you have with respect to embedding Secrets for using other providers like AWS S3

Filtered and Prefix Transfer

Rclone configurations can be configured for path-based filters on source and destinations.

For example, the following Rclone CLI will transfer files of type to a destination prefix:

Sharded Transfers with Filters

Filtered transfer can also be employed to shard the input file sequences. That is, if you have 100K objects to synchronize which are lexically evenly distributed, you can setup prefixed transfers based on the distribution range of the files (i.e., transfer files with prefix “a*”, then “b*” and so on). You can dispatch each shard on its own new Cloud Run service through Cloud Tasks.

In this mode, the architecture is

Cloud Scheduler -> Cloud Run (dispatch to N Shards) -> N(Cloud Tasks) -> N(Cloud Run(prefix) -> N(Read Prefix->Write Prefix)

You can find a similar Sharding technique here: Yet another image file converter on GCP

Embedding Secrets

If you need to embed Storage object secrets like AWS ACCESS_KEY_ID values within Cloud Run, consider using Google Cloud Secret Manager. You will need to first upload the secret, give IAM permissions to read it to Cloud Run.

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

gcloud beta secrets create access_key_id \
  --replication-policy=automatic \
  --data-file=$AWS_ACCESS_KEY_ID --project $PROJECT_ID -q

gcloud beta secrets add-iam-policy-binding access_key_id \
  --member=serviceAccount:$RSYNC_SERVER_SERVICE_ACCOUNT \
  --role=roles/secretmanager.secretAccessor  \
  --project $PROJECT_ID -q  


gcloud beta secrets create secret_access_key \
  --replication-policy=automatic \
  --data-file=$AWS_SECRET_ACCESS_KEY --project $PROJECT_ID -q

gcloud beta secrets add-iam-policy-binding secret_access_key \
  --member=serviceAccount:$RSYNC_SERVER_SERVICE_ACCOUNT \
  --role=roles/secretmanager.secretAccessor  \
  --project $PROJECT_ID -q  

Event-based File Synchronizing

The default configuration in this repo uses time based synchronization based on triggers from Cloud Scheduler. You can also re-factor server.go to be event driven.

For example you can trigger synchronization if a file in the Source Bucket has been modified:

Via A) GCS Object Change Notification

or

B) A source system which emits Cloud Events

To use Cloud Run with a system that emits Cloud Events, you may need to “convert” the inbound request to an Event.

The following describes GCP PubSub messages and raw AuditLogs parsing as both HTTP handlers or HTTP Handlers to Cloud Events:

Configure AWS Source/Destinations

To configure Rclone for S3, see Amazon S3 Storage Provider.

Specifically, configure the startup of the rclone cloud run instance to use the AWS S3 access_key_id and secret_access_key.

You can bootstrap the Cloud Run instance using GCP Secrets

For example, if you have an AWS Bucket, first verify that you can access the bucket using a sample before deploying to Cloud Run

  • rclone.conf
[gcs-src]
type = google cloud storage
bucket_policy_only = true

[aws-src]
type = s3
provider = AWS
env_auth = false
access_key_id = ...
secret_access_key = ...
region = us-east-2
$ ./rclone ls gcs-src:mineral-minutia-820-rsync-dest
$

$ ./rclone ls aws-src:mineral-minutia
      411 README.md

$ ./rclone sync  aws-src:mineral-minutia gcs-src:mineral-minutia-820-rsync-dest

$ ./rclone ls gcs-src:mineral-minutia-820-rsync-dest
      411 README.md

Once you have verified the above, map the rclone.conf file settings into server.go

This site supports webmentions. Send me a mention via this form.