gcp

BigQuery Remote Functions in Go

2022-02-28

A BQ Remote Function that just does something simple:

for each batch of rows, a Cloud Function will calculate the HMAC-SHA256 using a passphrase provided by the function caller.

While its simple, the important thing to note is you can run pretty much any transformation on the source BigQuery data (decrypt, encrypt, encode, convert to XML, etc) using code that runs in cloud functions.

Just be careful that at BQ scale that the functions are relatively simple and quick (i,e minimize external dependencies and services inside the function!).

this feature is in private preview (as of 2/28/22); running this tutorial requires allowlist/enablement.

bq remote functions

Setup


You can find the source here


First create the cloud function and test

export GCLOUD_USER=`gcloud config get-value core/account`
export PROJECT_ID=`gcloud config get-value core/project`
export PROJECT_NUMBER=`gcloud projects describe $PROJECT_ID --format='value(projectNumber)'`

gcloud functions deploy hmac_sha256 \
  --runtime go116 \
  --entry-point HMAC_SHA256 \
  --region=us-central1 \
  --trigger-http
# Allow unauthenticated invocations of new function [remote_concat]? (y/N)?  N

gcloud functions add-iam-policy-binding hmac_sha256 \
  --member="user:$GCLOUD_USER" \
  --role="roles/cloudfunctions.invoker"

Now test directly using curl just to validate the function works.

Specifically, the expected HMAC base64 encoded value with the following plaintext and key will be

export key="key"
export plaintext="The quick brown fox jumps over the lazy dog"
echo -n  "$plaintext" | openssl dgst -sha256 -hmac "$key"  -hex -r | xxd -r -p | base64

97yD9DBThCSxMpjmqm+xQ+9NWaFJRhdZl0edvC0aPNg=

so, if we POST the following payload that mimics what BQ would send, we’ll see the appropriate rows hashed to what we expect:

{
    "requestId": "124ab1c",
    "caller": "//bigquery.googleapis.com/projects/myproject/jobs/myproject:US.bquxjob_5b4c112c_17961fafeaf",
    "sessionUser": "test-user@test-company.com",
    "userDefinedContext": {
     "key1": "value1",
     "key2": "v2"
    },
    "calls": [
     ["The quick brown fox jumps over the lazy dog", "key"],
     ["The quick brown fox jumps over the lazy dog", "bar"],
     ["foo", "bar"],
     ["bar", "bar"],
     ["foo", "bar"],     
     ["The quick brown fox jumps over the lazy dog", "key"],
     ["The quick brown fox jumps over the lazy dog", "bar"]
    ]
}

the output will be in order of the rows sent in (i.e., we expect the hashes of (row1, row6) to match as well as (row2, row7) and finally (row3, row5)

curl -s  -X POST  \
  -H "Authorization: Bearer `gcloud auth print-identity-token`" \
  -H "Content-Type: application/json"  \
  -d @req.json \
  "https://us-central1-$PROJECT_ID.cloudfunctions.net/hmac_sha256" | jq '.'


{
  "replies": [
    "97yD9DBThCSxMpjmqm+xQ+9NWaFJRhdZl0edvC0aPNg=",
    "4Gwg9rSBeSWMLz7yltK5ujS3fXz6C1eSottgyOMajjY=",
    "FHkzIYqqvAuLEKKzpcNGhMjZQ0G88QpHNtxycPd0GFE=",
    "ltlEFLFUq3pBoeLIS9b6WIdM1N6LFrFrFNsuMZDVMww=",
    "FHkzIYqqvAuLEKKzpcNGhMjZQ0G88QpHNtxycPd0GFE=",
    "97yD9DBThCSxMpjmqm+xQ+9NWaFJRhdZl0edvC0aPNg=",
    "4Gwg9rSBeSWMLz7yltK5ujS3fXz6C1eSottgyOMajjY="
  ]
}

…so now back to bigquery..

Insert a bigquery table with the key and plaintext

printf '%s\n' 'The quick brown fox jumps over the lazy dog' > /tmp/file.csv
gsutil mb -l us-central1  gs://$PROJECT_ID-bq
gsutil cp /tmp/file.csv gs://$PROJECT_ID-bq/

bq mk -d --data_location=US mydataset1

bq load     --autodetect     --source_format=CSV     mydataset1.mytable     gs://$PROJECT_ID-bq/file.csv plaintext:STRING

bq query --nouse_legacy_sql  "
SELECT
  plaintext
FROM
  $PROJECT_ID.mydataset1.mytable
"

bq mk --connection --display_name='myconn' --connection_type=CLOUD_RESOURCE \
        --project_id=$PROJECT_ID --location=US my-connection

bq show --location=US --connection  my-connection


export BQ_CONN_SVC_ACCOUNT=`bq show --format=json --location=US --connection  my-connection  | jq -r '.cloudResource.serviceAccountId'`
echo $BQ_CONN_SVC_ACCOUNT

# now allow the service account to call the function
gcloud functions add-iam-policy-binding hmac_sha256 \
  --member="serviceAccount:$BQ_CONN_SVC_ACCOUNT" \
  --role="roles/cloudfunctions.invoker"

# create the connection
 bq --format=json query --dataset_id=$PROJECT_ID:mydataset1 --location=US --nouse_legacy_sql  "
  CREATE FUNCTION hmac_sha256(plaintext STRING, key STRING) RETURNS STRING 
    REMOTE WITH CONNECTION \`$PROJECT_ID.us.my-connection\`
    OPTIONS (endpoint = 'https://us-central1-$PROJECT_ID.cloudfunctions.net/hmac_sha256',  user_defined_context = [('mode', 'hash')] )
"

# now use the function
bq query --nouse_legacy_sql  "
SELECT
  plaintext, mydataset1.hmac_sha256(plaintext,'key') as hmac_sha256
FROM
  $PROJECT_ID.mydataset1.mytable
"
+---------------------------------------------+----------------------------------------------+
|                  plaintext                  |                 hmac_sha256                  |
+---------------------------------------------+----------------------------------------------+
| The quick brown fox jumps over the lazy dog | 97yD9DBThCSxMpjmqm+xQ+9NWaFJRhdZl0edvC0aPNg= |
+---------------------------------------------+----------------------------------------------+

You can see the invocation in cloud functions logs directly:

images/deid.png

For reference on javascript UDF, see BigQuery HMAC-SHA256 User-defined Function


Also see


This site supports webmentions. Send me a mention via this form.