Deploying Worker

Once you add pg_track_events triggers to your Postgres instance it will begin storing row changes in an outbox table called schema_pg_track_events.event_log. A worker will then pick up these row changes, transform them into analytics events, send them to your destinations, before deleting them from the log.

alt

These workers are easy to deploy and run on any infrastructure that supports Docker. Using pg_track_events init creates a simple Dockerfile that pulls the image and copies in your pg_track_events.config.yaml file. The base image is published to GitHub Container Registry: https://github.com/tight-eng/pg_track_events/pkgs/container/pg_track_events%2Fagent

FROM ghcr.io/tight-eng/pg_track_events/agent:latest
 
COPY pg_track_events.config.yaml .

You can build and run the image locally to test it out.

# Build image
docker build -t pg_track_events_agent .
 
# Run image (interactive mode, for testing)
# The POSTHOG_API_KEY is just an example, you'll need to pass in whatever env vars you're referencing from your pg_track_events.config.yaml file
docker run -it -e DATABASE_URL="..." -e POSTHOG_API_KEY="..." pg_track_events_agent

Important things to know

The container will fail to start if you do not provide a DATABASE_URL, and the API keys for the destinations you set up.
You handle the networking. The container assumes it can reach the Postgres instance and the destinations.
Adding a new destination won’t trigger a backfill.
You probably only need one worker, but having more than one running won’t break anything or lead to duplicate events.
Downtime redeploying the container won’t cause any events to be missed. The unprocessed events remain in the outbox until processed.
Destination outages or delivery errors will prevent events from leaving the outbox (you will not lose data). Delivery errors are tracked in the outbox and the worker will follow an exponential backoff (up to a max of 60mins) to retry events. After reaching 60mins, events will continue to be retried hourly.
Destinations without event deduplication logic, currently just BigQuery and S3, may occasionally see duplicate records in the case of batch retries where some events may have already been successfully written while others failed. When consuming data from BigQuery and S3, you can use the event name and ID for processed events or just the ID for raw database change events to deduplicate as you query or read data out of those destinations.

Need something more scalable?

We're working on a worker that can stream off the WAL. It doesn't require the outbox table, will be more performant, and more reliable. If you'd like to contribute or share your use case please add a comment to this issue.

Deploying Worker

Important things to know

Need something more scalable?

On this page