BBHoss

A brief history of my experiences in programming & system operations

Announcing Cainophile

Posted at — Jul 25, 2019

Today I’m announcing Cainophile, a library to assist you in building change data capture (CDC) systems in Elixir. With Cainophile, you can quickly and easily stream every change made to your PostgreSQL database, with no plugins, Java, or Zookeeper required.

Why Cainophile

Cainophile was created with small companies and startups in mind. Often, these companies don’t have the the engineer time needed to operate a proper change data capture system using a popular tool like Debezium, which typically involves standing up a Zookeeper distributed database cluster for configuration, and a Kafka cluster for storing and publishing the messages. In addition to the complexity surrounding these tools, you often need to compile a tool like wal2json from source, and ensure it is available as a shared library on your database server. This also typically requires you to run your own database server, preventing you from using your cloud provider’s hosted database offering. Finally, Elixir (and Erlang) excel at dealing with streaming event-based data, so I thought it would be a useful tool for the community to have access to.

Why capture changes?

Capturing changes may be a little new to readers that don’t have experience with data engineering or working with enterprise systems, but there are some really good reasons to do so.

Auditing

One of the reasons that’s most relevant to smaller companies is auditing. With CDC (and Cainophile), you can easily keep track of everything a user changes in your system, and store it forever. This means every record touch, every login time (assuming you track “last logged in”), every name change, etc will be captured and available for archive or real time monitoring. This is useful for security/forensic purposes, but may also be helpful for debugging, or recovering from a software bug that corrupts data.

Analytics

While having a discrete event log is the preferred way to do event-based analytics, it’s not always feasible to instrument your app like this, whether due to time constraints or lack of control. In addition, unless you’ve built your app using event sourcing, you will be losing changes/events that you didn’t consider tracking ahead of time. Using change data capture for analytics allows you to rest easy, knowing that if the business asks a question regarding something that changes in the database, you can answer it.

Indexing

Changes can also be used to maintain a secondary index/cache of your data, such as an Elasticsearch cluster. With Cainophile, you can write some simple integration code that transforms and sends data to Elasticsearch in real-time, or just use it to maintain an ETS table that a Phoenix channel interrogates for real-time updates via Websockets.

What’s next

Cainophile is brand-new and un-tested. It needs better documentation on using it with deployments other than localhost, and all of the bug fixes that come with time in production. I also want to add type-casting capabilities, as currently the values of the columns are just strings, as this is how Postgres sends them over the wire. This is a little more involved than it might seem, as it will require porting and maintaining the postgres text to native type conversion code, or using it directly via a C-extension. Once it’s in a production-ready state, I plan on publishing some common connectors, such as a connector to publish data directly to a BigQuery table, or just pushing to a PubSub exchange. Finally, I would like to investigate using native Erlang/OTP clustering capabilities to provide high-availability, similar to what Debezium is able to provide with Zookeeper.

comments powered by Disqus