Deploying Worker
Things to know before deploying an Analytics Worker
Once you add pg_track_events
triggers to your Postgres instance it will begin storing row changes in an outbox table called schema_pg_track_events.event_log
. A worker will then pick up these row changes, transform them into analytics events, send them to your destinations, before deleting them from the log.
These workers are easy to deploy and run on any infrastructure that supports Docker. Using pg_track_events init
creates a simple Dockerfile that pulls the image and copies in your pg_track_events.config.yaml
file. The base image is published to GitHub Container Registry: https://github.com/tight-eng/pg_track_events/pkgs/container/pg_track_events%2Fagent
You can build and run the image locally to test it out.
Important things to know
- The container will fail to start if you do not provide a
DATABASE_URL
, and the API keys for the destinations you set up. - You handle the networking. The container assumes it can reach the Postgres instance and the destinations.
- Adding a new destination won’t trigger a backfill.
- You probably only need one worker, but having more than one running won’t break anything or lead to duplicate events.
- Downtime redeploying the container won’t cause any events to be missed. The unprocessed events remain in the outbox until processed.
- Destination outages or delivery errors will prevent events from leaving the outbox (you will not lose data). Delivery errors are tracked in the outbox and the worker will follow an exponential backoff (up to a max of 60mins) to retry events. After reaching 60mins, events will continue to be retried hourly.
- Destinations without event deduplication logic, currently just BigQuery and S3, may occasionally see duplicate records in the case of batch retries where some events may have already been successfully written while others failed. When consuming data from BigQuery and S3, you can use the event name and ID for processed events or just the ID for raw database change events to deduplicate as you query or read data out of those destinations.
Need something more scalable?
We're working on a worker that can stream off the WAL. It doesn't require the outbox table, will be more performant, and more reliable. If you'd like to contribute or share your use case please add a comment to this issue.