Best Practices

Batch vs. event-driven operational analytics | Census

Sarah Krasnik
Sarah Krasnik November 09, 2021

Sarah Krasnik is a data consultant by day and data blogger, advisor, and generally curious data person by night. Previously a data engineer at Perpay, she has since transitioned to an advisory board member for Census and Superconductive. Philadelphia, Pennsylvania, United States

In this article, you'll learn the ins and outs of deciding between batch vs event-driven operational analytics. I'll break down:

  • Scheduled data upkeep vs action-driven events
  • Considerations when using batch vs event orchestration systems
  • How to prioritize stakeholder needs in your architecture

When you send your data directly to third-party tools, you allow business users (read: your marketers, salespeople, etc.) to use information directly where they spend time day-to-day. This is a really powerful data flow and one that democratizes the use of consistent, high-quality data.

If you’re already on board with this idea, you’ve probably identified how you’ll collect click and pageview event data. You’ve also scoped out the reverse ETL tool to load batch data into a destination target. Now you’re left with an important architectural decision that will shape the way business users engage with your data: How and when to send your data.

The million-dollar question: What does this decision between batch and event systems entail, and what are its impacts? Let’s discuss.

Scheduled data upkeep vs. action-driven events

Workflow orchestrators like Airflow, Dagster, and Prefect have become integral to modern data stacks. These tools run processes such as dbt queries or reverse ETL on a set schedule ranging everywhere from once per day to every 15 minutes.

Here’s the main difference between batch data upkeep and event-driven systems:

  • In batch data upkeep, syncs are run on a schedule to keep a resulting dataset up to date.
  • Event-driven systems maintain a stream of events that occur over time.

It always comes down to formatting—the fundamental difference between batch and event-driven systems: the format of the data in the target tool. A dynamic dataset updated with changes (batch) versus an ever-growing historical timeline of events that’s only appended to (event-driven).

Batch systems tend to be built on top of common data tools. Event orchestration occurs client-side on a website, so it must be fairly ingrained in the engineering tech stack. Common tools for event orchestration include Segment, Amplitude, Snowplow, among many others.

For illustration purposes, consider a typical CRM tool like Salesforce or Hubspot. A CRM contains lead contact information like name, email, and phone number. Additional data could include the last date this lead interacted with your site and the ad platform (Facebook, Google, etc.) the lead was sourced from. For example, if a lead changed their phone number, the next data sync would update this crucial piece of information.

By contrast, event-driven systems will send the specific event occurrence to the target tool. Continuing with the same example, a CRM tool likely also contains all pages viewed by the lead. Is the person interested in your paid or free offering? Is the person engaging with your product or blog pages? Has the lead contacted you in the past, and what did they say? If a person clicks on a link, the click event would (fairly quickly) show up in Salesforce.

This CRM example is just one of many use cases where the issue of batch vs. event-driven systems comes into play. Let’s dive into what business use cases would be most applicable in each system design, as well as the resources needed to implement it.

Considerations when using batch vs. event orchestration systems

Generally speaking, there are three primary considerations that determine if you should go the batch or event systems route:

  1. The team within your organization
  2. A desire to update historical data (or not)
  3. Business use case

Let me break down the considerations and which system is best for each.

The team within your organization

Event orchestration is usually integrated within a web application, while batch uploads are easily configured with a data warehouse or data lake. Each system requires its own expertise. Ask yourself: Is your company riddled with data engineers or front-end developers? If the former, batch uploads may be the easiest route; if the latter, you may want to go with event orchestration.

However, event orchestration gets tricky when trying to consider all caveats. For instance, anyone implementing or using the data should fundamentally be aware of ad blockers and their effects.

With batch jobs, there’s more stress on the underlying data than the actual reverse ETL implementation. This means your analytics team is tasked with formatting and understanding what data needs to be sent.

Desire to update historical data.

If an organization’s concept of a customer changes, they might need a bulk update to make sure the leads in the CRM are the contacts the sales team wants to focus on. In a batch system, any update to underlying data would bulk update the destination tool; this type of change would be a walk in the park.

However, by definition, event data is append-only. In non-tech speak, this means historical events never go away. If a lead viewed the /blog page that was renamed to /community, the sales team would have to account for both cases when targeting a group of leads.

Consider how often data that already exists updates. If often, batch updates would be easier to sustain automatically.

It all comes down to the use case.

Event data makes triggering emails on a particular action extremely straightforward. For example, consider a sales strategy that automatically triggers emails three hours after someone views a demo page with the hopes of engaging the lead while their interest is high. Both are easy to implement if the event data contains a /demo page view event and the batch data contains a last_demo_page_viewed_time field.

However, what if the sales strategy changes to three hours after someone views the community page instead of the demo page? Sales can update this logic directly in Salesforce when using event data as all the data is readily available. In the batch scenario, the analytics team would need to add a field for the last time the community page was viewed to make this possible.

When triggering particular actions based on particular events, relying on those same events by sending them directly to the target tool will almost always be easier for the stakeholder.

At the end of the day, the decision comes down to who uses your data (and how)

With continuous syncing, the decision is less of a question of how quickly the data should be updated but more a question of how fresh end users need the data to be (realistically). Both reverse ETL tools like Census and event orchestration tools like Segment could operate with near real-time updates or on a delay.

Let’s recall the reason we’re syncing to destination tools: to make stakeholders’ jobs easier.

Keeping that in mind, talk to your stakeholders about the workflows they want to build. Ask your sales team if they want to send emails on a schedule to a population of users meeting criteria at one point in time (easily run by a batch system), or send emails on a rolling basis triggered by a particular action (catered towards events).

The last thing you want to do is format the data in such a way that’s fundamentally unusable by the stakeholder, in which case they’ll just continue to make manual updates. That’s not convenient for anyone.

Piqued your interest? Check out the Census Airflow Provider and sign up for a free trial to start experimenting.

Have any other questions about event orchestration systems?  I’m happy to chat on Twitter or LinkedIn.

Related articles

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps:

Best Practices
How Retail Brands Should Implement Real-Time Data Platforms To Drive Revenue
How Retail Brands Should Implement Real-Time Data Platforms To Drive Revenue

Remember when the days of "Dear [First Name]" emails felt like cutting-edge personalization?

Product News
Why Census Embedded?
Why Census Embedded?

Last November, we shipped a new product: Census Embedded. It's a massive expansion of our footprint in the world of data. As I'll lay out here, it's a natural evolution of our platform in service of our mission and it's poised to help a lot of people get access to more great quality data.