Using BigQuery for Billing Alerts
As almost any cloud user these days, we were facing the well known issue: how to get warned once the expenses are running higher than expected. Of course, you can always use the Budgets & alerts included in the billing dashboard. Those are more and less reactive. In case something bad is already happening, you get the notification once it’s too late. We were looking for something more sensitive. We were willing to get notified even in case of a slight issue just to be sure that the bill won’t run too high. That’s why this experiment happened and we created a terraform module containing the code which would notify us and send us billing alerts once the data is not following the overall trend.
Also, we did not need to get the notification for every new SKU entry in the billing. Getting notified once a day was enough for us. But even that could pose its challenges. The thing is that not all SKUs are present during the evaluation at a particular time. Some of those are inserted into billing with a delay of a day. It depends on the unit and how it’s billed to the customer.
Our issues were almost always related to SKU going above expected value and staying like this for a few days until it reached budget limitations. Having a check which would compare the long time average with the posterior values would do the trick. Therefore we decided to compute an average of 14 days for each SKU and compare it with the average from yesterday and the day before yesterday. The comparison takes place every morning, therefore we wouldn’t lose that much in case any SKU rocketed to the moon a day before.
Computing average billing data exported to BigQuery is surprisingly easy with the analytics functions. All exported data is already partitioned, therefore you wouldn’t pay extra for the data processed by query. The SQL statement ended up looking like this:
SELECT DISTINCT sku_description,
VAR_SAMP (sku_cost) OVER (windows_sku_description) AS var_sku_cost,
AVG (sku_cost) OVER (windows_sku_description) AS avg_sku_cost,
MAX (sku_cost) OVER (windows_sku_description) AS max_sku_cost,
MIN (sku_cost) OVER (windows_sku_description) AS min_sku_cost,
FROM (
SELECT sku.description AS sku_description, SUM (cost) OVER (PARTITION BY sku.id, _PARTITIONTIME) AS sku_cost
FROM `{TABLE_WITH_BILLING}`
WHERE DATE(_PARTITIONTIME) > "{past_two_weeks}" AND DATE(_PARTITIONTIME) < "{yesterday}" AND project.id = "{GCP_PROJECT_ID}"
)
WINDOW windows_sku_description AS (
PARTITION BY sku_description
)
ORDER BY avg_sku_cost DESC
The first inner query selects the sum of costs for each SKU:
[
{
"sku_description": "Network Inter Region Egress from Frankfurt to Americas",
"sku_cost": "0.059226999999999995"
},
{
"sku_description": "Network Inter Region Egress from Frankfurt to Americas",
"sku_cost": "0.059226999999999995"
},
{
"sku_description": "Network Inter Region Egress from Frankfurt to Americas",
"sku_cost": "0.059226999999999995"
},
...
The problem is that the line repeats itself for each SKU entry based on the PARTITION BY statement. That’s why there is a second select which creates average, max and min for each SKU:
[
{
"sku_description": "N1 Predefined Instance Core running in Frankfurt",
"var_sku_cost": "2.185961637128035",
"avg_sku_cost": "17.668332591494845",
"max_sku_cost": "19.780804",
"min_sku_cost": "14.276247999999999"
},
{
"sku_description": "vCPU Time Streaming Frankfurt",
"var_sku_cost": "0.3603937414788865",
"avg_sku_cost": "9.798346789414413",
"max_sku_cost": "10.607520999999998",
"min_sku_cost": "8.719346"
},...
Once we were done creating the SQL, everything got packaged into the Python code and deployed to the Cloud Functions. For warnings, we always utilized a simple Slack channel. Having configured Slack webhook, sending data to it is as simple as this one liner:
requests.post(SLACK_HOOK, json={'text': error_msg})
For a few projects, we prefer to notify support directly. Whole oncall stack is managed by OpsGenie, therefore we created new OpsGenie integration as an API and submit the messages like this:
requests.post(
OPSGENIE_ENDPOINT,
headers={
'Content-Type': "application/json",
'Authorization': f'GenieKey {OPSGENIE_TOKEN}'
},
json={
"message": f'Billing alert on {GCP_PROJECT_ID}',
"description": error_msg,
"priority": "P3"
}
)
Setting up cron in GCP can be done by Cloud Scheduler. Don’t forget that the call to the cloud function should be authorized. The HTTP call from the schedule can utilize OIDC token:
resource "google_cloud_scheduler_job" "job" {
...
http_target {
http_method = "GET"
uri \= google_cloudfunctions_function.function.https_trigger_url
oidc_token {
service_account_email = google_service_account.invoker.email
}
}
}
Whole terraform module is available on github. In case you are wondering about the Python code, it’s also in the module.
What could be done in the future
As it’s mentioned in the repository, SA key is just an input variable. Terraform uploads the key file directly to the cloud function source code. That means almost anyone who has access to the Cloud Functions in the project can steal the key file. The thing is that we couldn’t do it differently, the customer did not add permissions to our SA, just gave us the key json file. The next step should be to at least load the key from the secret manager.
We are aware that having a cloud function checking the billing is not the smartest way to get alerted. I guess the best way would be to classify outliers and work with those. BigQuery also supports machine learning functions. We can predict the future values with regression and get the warning much sooner. But for our use case, having a simple cloud function is enough.
Hopefully, you found out something interesting here. In case I mess up something which I haven’t understood correctly, please let me know.