gcp

BlobZapper: Deleting 1M files on GCS in 20mins

2022-03-11

This is a small golang script that uses the Storage Batch API to delete files or update an object’s metadata in GCS.

Instead of processing each file individually, the batch api allows you to combine upto 100 separate operations on different objects into one api call.

The storage batch api client library is only available in python and java (see com.google.cloud.storage.StorageBatch) but in this example we will adapt jfcote87@ implementation in Go for the genericBatch API.

Note, Google has officially announced Discontinuing support for JSON-RPC and global HTTP batch endpoints but it seems the storage api maybe exempt(?) « this is just my speculation as of 3/11/22 after reading:

A batch request is homogenous if the inner requests are addressed to the same API, even if addressed to different methods of the same API. 

Homogenous batching will still be supported but through API specific batch endpoints.

I’m guessing this is ok since GCS uses the following service-specific endpoint "https://storage.googleapis.com/batch/storage/v1"

Who knows…this is just a demo, unsupported… caveat emptor

you know where this is from; i read the red color was an allusion to communist paranoia back then..


Anyway, this script intercepts a normal GCS outbound HTTP call using a custom http/Roundtripper.

What the RoundTripper does is reads the header and body contents of the data that a client intends to send to GCS but instead of actually putting anything on the wire, it populates a different batch requests’s multipart payload.

What that means is to delete a file, initialize the client and call the API as normal. You do not need to provide any credentials to this (hence foo)

hc, err := google.DefaultClient(ctx)
if err != nil {
    fmt.Printf("could not create default client: %v", err)
        return
}
hc.Transport = wrapped{hc.Transport, writer}
storageClient, err := storage.NewClient(ctx,
    option.WithHTTPClient(hc),
    option.WithTokenSource(oauth2.StaticTokenSource(&oauth2.Token{
        AccessToken: "foo",
        Expiry:      time.Now().Add(24 * time.Hour),
    })))
   defer storageClient.Close()

Once you configure a client, just call the delete function:

Delete Objects

// to delete
bkt := storageClient.Bucket(*bucketName)
for _, filename := range set {
    obj := bkt.Object(filename)
    err := obj.Delete(ctx)
    if err != nil {
        fmt.Printf("error fake delete %v\n", err)
         return
    }
}

Update Metadata

Similarly, to update metdata, just call the library function as normal

// to update metadata
objectAttrsToUpdate := storage.ObjectAttrsToUpdate{
	Metadata: map[string]string{
		"k1": "v1",
	},
}
if _, err := obj.Update(ctx, objectAttrsToUpdate); err != nil {
	fmt.Printf("error updating local handler: %v\n", err)
	return
}

the serial request to DELETE or PATCH is intercepted and written as one of the payloads to the batch multipart request.

All this happens in a go routine in increments that try to stay under the Batch API’s limits (100 individual requests per batch) and the max suggested default write operations per second per bucket (1000).

So…

Setup

Create GCS bucket and 1M files

what? you don’t have a million empty files just lying around?

export PROJECT_ID=`gcloud config get-value core/project`
gsutil mb -l US-CENTRAL1 gs://$PROJECT_ID-batch-regional-us

mkdir files
cd files
for i in `seq 1 1000000`; do echo "" > ./$i; done

(better would be to create files as sha256($i) to help prevent any potential hotspotting…but i’m lazy)

Create upload.sh and upload in increments of 10K …this’ll take quite sometime …

like…really a lot of time..you could start it off, see it go to 15%, then go to this winery on your day off, get asked for your id there to prove your’e over 21, share a bottle (pretty good merlot), then come back, and still see it at 65%

…i must add:

…i don’t think you or i or anyone has comprehension of large numbers

…i think that comes from our nature where we only need to comprehend the numbers at the scale of seeing predators/prey and apples,bananas..not anything beyond…it becomes abstract, meaningless

…just watch the inserts scroll by

…then i wander to think of the numbers as $ and think of them as disparity you’d see all around if social justice is your thing.

…or if math/science is your thing, just try watching grahams number. its painful, humbling and existential (both the previous line and this are)…

i digress…

back down to earth, you could also shard the upload into N scripts in parallel but its still subject to the default 1000 write/second limit on a bucket…i’m lazy

#!/usr/bin/bash 

arrVar=("")
for n in {1..1000000}
do
    arrVar+=($n)
    if (( $n % 10000 == 0 ))
    then
        gsutil -m cp ${arrVar[*]} gs://$PROJECT_ID-batch-regional-us
        arrVar=("")  
    fi
done

Wipeout

Edit main.go, specify your bucket name and go scorched earth

go run main.go --bucketName=$PROJECT_ID-batch-regional-us

Benchmarks

I ran the code on 1M files and saw

  • start: Fri Mar 11 08:49:43 PM EST 2022
  • end: Fri Mar 11 09:08:13 ETC 2022

like 20minutes

gsutil ls gs://$PROJECT_ID-batch-regional-us

# empty... yeah, thats what you expect

but not quite…did you really expect that much precision out of 1M

this is what is got

$ gsutil ls gs://$PROJECT_ID-batch-regional-us | wc -l
66

so… 66/1000000 error rate in 20mins, right?

not bad..

if you really wanted to, you can catch, log and retry the failures offline in code..

fwiw, the errors in the files were from GCS server errors: Error status code We encountered an internal error. Please try again.

eg:

batch [7796], Response code: 200
Deleting filesset batch 7806
batch [7797], Response code: 200
Deleting filesset batch 7807
batch [7701], Response code: 200
Error status code We encountered an internal error. Please try again.
batch [7703], Response code: 200
batch [7798], Response code: 200
Error status code We encountered an internal error. Please try again.
batch [7707], Response code: 200
Error status code We encountered an internal error. Please try again.
Deleting filesset batch 7808
batch [7760], Response code: 200
batch [7799], Response code: 200
Deleting filesset batch 7809
batch [7711], Response code: 200
Error status code We encountered an internal error. Please try again.
batch [7800], Response code: 200
batch [7727], Response code: 200
Error status code We encountered an internal error. Please try again.
batch [7718], Response code: 200
Error status code We encountered an internal error. Please try again.
batch [7740], Response code: 200
Error status code We encountered an internal error. Please try again.
batch [7756], Response code: 200
Deleting filesset batch 7810
batch [7801], Response code: 200
batch [7779], Response code: 200
batch [7763], Response code: 200
Deleting filesset batch 7811
batch [7762], Response code: 200
Deleting filesset batch 7812


# and you'd expect  within the batches

$ gsutil ls gs://$PROJECT_ID-batch-regional-us
gs://$PROJECT_ID-batch-regional-us/770078
gs://$PROJECT_ID-batch-regional-us/770265
gs://$PROJECT_ID-batch-regional-us/770583
gs://$PROJECT_ID-batch-regional-us/770610
gs://$PROJECT_ID-batch-regional-us/770718
gs://$PROJECT_ID-batch-regional-us/770770
gs://$PROJECT_ID-batch-regional-us/770860
gs://$PROJECT_ID-batch-regional-us/770869
gs://$PROJECT_ID-batch-regional-us/771042
gs://$PROJECT_ID-batch-regional-us/771295
gs://$PROJECT_ID-batch-regional-us/771298
gs://$PROJECT_ID-batch-regional-us/771630
gs://$PROJECT_ID-batch-regional-us/771660
gs://$PROJECT_ID-batch-regional-us/771683
gs://$PROJECT_ID-batch-regional-us/771712
gs://$PROJECT_ID-batch-regional-us/772209
gs://$PROJECT_ID-batch-regional-us/772281
gs://$PROJECT_ID-batch-regional-us/772334
gs://$PROJECT_ID-batch-regional-us/772358
gs://$PROJECT_ID-batch-regional-us/772367
gs://$PROJECT_ID-batch-regional-us/772682
gs://$PROJECT_ID-batch-regional-us/772723
gs://$PROJECT_ID-batch-regional-us/772906
gs://$PROJECT_ID-batch-regional-us/773962
gs://$PROJECT_ID-batch-regional-us/774617
gs://$PROJECT_ID-batch-regional-us/774642
...

You can see each delete if you happens to have GCS auditlogs enabled


Appendix

Benchmark with just 10K files

Note, i just used 1000000 because i wanted to test with 1M. If all you want to do is test with say 10000 files, just create and upload 10K files. You will also want to edit the code snippet in the end and set totalFiles = 10000

for i in `seq 1 10000`; do echo "" > ./$i; done
gsutil -m cp * gs://$PROJECT_ID-batch-regional-us

# edit main.go and set totalFiles = 10000
go run main.go --bucketName=$PROJECT_ID-batch-regional-us
Using predicable filenames

Some Notes…i only tested the deletes and in a way cheated: i seeded the GCS bucket with predictable filenames: [1, 2, 3, 4, 5,...,9999999, 1000000] THis allowed me to just iterate on filenames and issue delete operations easily (i.,e without iterating existing objects)


this is what i did on my day off…

This site supports webmentions. Send me a mention via this form.