Product News

The Sequel Show is now on Clubhouse | Census

Boris Jabes March 08, 2021

Boris is the CEO of Census. Previously, he was the CEO of Meldium, acquired by LogMeIn. He is an advisor and alumnus of Y Combinator. He enjoys nerding out about data and technology, 8-bit graphics, and helping other startup founders.

When we started Census in 2018, we didn't just want to build a great data integration tool. We wanted to bring the power of software engineering practices to the world of ops & data. I think of the evolution of the data ecosystem through this lens, whether that's our industry building new tools or whether that's our customers discovering new processes and techniques. What's obvious is that despite how much we write about data here, it's still early for our ecosystem, which is growing everyday.

This theme wove its way through the first episode of The Sequel Show, a conversation about data on Clubhouse that I’m going to host & publish on a regular basis going forward. Each episode, I’ll host some of the most interesting people in the data and analytics world to talk nerd out on fun data problems, cool products, and general trends.

Alright data fans on @joinClubhouse, I’m doing a show with @frasergeorgew @jthandy and @narayanarjun this Thursday at 6p PT. Join us for a casual conversation on all things modern data stack!

PS: DM me for invites.https://t.co/i3ZxwT6i4q
— Boris Jabes (@borisjabes) March 1, 2021

I couldn’t think of a better way to start the series than to host the CEOs of the companies that make up the modern data stack:

George Fraser - CEO @ Fivetran
Arjun Narayan - CEO of Materialize
Tristan Handy from Fishtown Analytics dbt

In this first episode, we discussed why we built our companies, the past, present and future of data warehouses, what data teams will look like in the future and how all of this will affect what we do everyday.

Like I said earlier, we kept coming back to how much has changed in a very short amount of time over the past few years, and how much change is ahead. One of the most interesting topics we covered is how the way data teams work has to change the more you consume data. I often use the analogy of DevOps here.

“When you start using something like Census, that means that data is going to be live in a lot more systems, so it’s bad if you screw up. There’s a lot more responsibility there, and data teams will be responsible for testing, validating and making sure it's of high quality. That's what software teams do relatively well. We've learned so much about how to ship systems on the internet, and we have to learn how to do the same for data.”

George said this change – and Fivetran helping to enable it – was one of the most inspiring parts of his job.

“My favorite thing about the work that we're all doing is that we're expanding the scope of what you can do if you know SQL,” he said. “We're allowing people who know SQL to do a lot of the things that previously you could have only done if you were a software engineer. I think that's a really, great mission – bringing that power of software engineering to a larger group of people.”

Be sure to follow me on Twitter where I'll post show related news or follow @borisj on Clubhouse to join the next live session so you can drop in, listen and ask questions.

You can listen to the full initial episode embedded below.

And with that, I think we just launched a podcast too? Follow us on:

Podcast feed (RSS feed, copy that in Overcast and co)
Spotify Podcast
Google Podcast
PocketCast
Apple Podcast (coming soon...)

Full Episode Transcript

Tristan Handy:
Why didn't we invite the streaming guy for us?

Boris Jabes:
That's true. I mean, did muscle his way into this conversation. So Arjun, yeah, I think you're in the Mensa Quiz of what doesn't fit in the set. I think, yeah, you are the outlier.

George Fraser:
[crosstalk 00:00:19]. We've we've offended him. Sorry.

Arjun Narayan:
No.

Boris Jabes:
That was fast George. That was really fast.

Arjun Narayan:
There was supposed to be some tough love and it ended up just being tough. Oh, man.

Boris Jabes:
People that don't know, George can be actually quite cutting. So don't worry. That's just par for the course. It's because he's the most grizzled entrepreneur in the bunch. That's not necessarily a compliment, George. You just have more scars, I think, than all of us.

George Fraser:
I don't know. Tristan's got some serious scars. Tristan's got some RJ metrics stitch scars. I think I might've given Tristan a couple of the scars.

Tristan Handy:
That's true. That is totally fair. I was trying to think of the last time I woke up in the middle of the night sweating and thinking about work, and it actually wasn't that long ago. It was the day that the Enterprise Tech 30 came out, and I was just like, "This is the last thing I need right now."

Boris Jabes:
Yeah, that was definitely an interesting... I did get a great question. This is a good just a realization of what we live in, in tech versus the real world. Old family friends, people that knew me when I was born emailed my parents asking, "Does this mean I can buy shares in Census?" And that's not the question I ever expected to get. That's what happens when they put some picture of you on a billboard in New York?

Tristan Handy:
Yeah, exactly.

Boris Jabes:
Arjun, are you there? Okay. All right. Well, Arjun, you are-

Arjun Narayan:
I'm back. I was having some internet issues, but I heard George making fun of me and then it dropped.

George Fraser:
I was just teasing you Arjun, you know I love Materialize.

Arjun Narayan:
We'll see. Should we get going?

Boris Jabes:
... yeah, I think we should.

Arjun Narayan:
I want to do a quick round of intros. We may have... By naming it the way we named it, inadvertently got some folks in here who may not quite know what they've gotten themselves into.

Boris Jabes:
That's true. This is the successor to the Ones Best Show, and now it's the Sequel to that show. Boris Jabes is how you say my last name. I'm the founder at a company called Census that loves to move data around and make it more useful. And I get the pleasure of working across the metaphorical aisle from Tristan and George every day.

Tristan Handy:
Hi? I'm Tristan. I'm the CEO and founder of Fishtown Analytics. We make a product called dbt and great smack in the middle of the modern data stack diagram that folks love to draw.

George Fraser:
And I'm George. I am co-founder and CEO of a company called Fivetran. We are the bottom layer of the modern data stack. We already plumbers who pipe all the data from all of your systems of record, like your Salesforce, your NetSuite, your Sequel database, all those zillions of things you have inside of your business into your data warehouse, where you do amazing things with it, with products from companies like Census and Fishtown Analytics.

Arjun Narayan:
Hi everyone? I'm Arjun Narayan. I'm the co-founder and CEO of Materialize. Materialize simplifies application development with streaming data. And so, while traditional data warehouses work with batch data that is updated relatively infrequently, materialize does similar things with Sequel the same standard Sequel that one would write against your data warehouse. The difference is Materialize takes in the data as it is changing live and maintains those queries results as the underlying input data changes and keeps it live.

Tristan Handy:
Hey, Arjun, I'm really curious, maybe it's because we're thinking about our own product marketing stuff right now. But you used a phrase right there, you said simplifies application development. And I just know that if somebody asked me what's Materialize, I think I do more marketing for you folks than most humans on the planet these days. [crosstalk 00:05:01]. I would've never said a simplifies application development. Why do you use those words? I would've said makes modern analytics faster, whatever, things like that.

Arjun Narayan:
Yeah, this is great. I mean, honestly, I would like you to go first. This is an interesting tackle of questioning. Can you just say a little bit more?

Tristan Handy:
Sure. Okay. So let's say that I was going to talk about Materialize. I would say materialize transforms the modern data stack into a streaming paradigm, or I would talk operational real-time analytics. But it sounds like you're thinking about it from a software engineers or application developer perspective, and maybe I'm getting your persona wrong.

Arjun Narayan:
Yeah. I mean, I think there's this inherent tension in Materialize that I think is coming out here, which is fundamentally what we technically do is we enable users, developers, to get the answers to OLAP queries on OLTB timescales. So traditionally, if you said OLAP query, everyone has already made the assumption that they're going to have to walk away, grab lunch, grab a cup of coffee, and then come back and maybe they'll have their answer. And those are the shapes of the queries that we fundamentally process. We go towards those really complex lead queries, the barley, six way joins, the 10 way joins thing with a sub query, and the nasty side of Sequel.

Arjun Narayan:
And then we keep those up to date as the underlying data is changing on a single double digit millisecond, maybe triple digit millisecond if it's really gnarly timescales, which allows you to take these... Where previously being run in an analytics pipeline, and then take those answers and then put them in your application so that they're live updated. So one of our personas of people who are very excited about Materialize are the ones who use something like an analytics system, whichever one you want to use. And then they compute some answer, and then they put that answer into some Redis cache, which gets hit in the application pipeline.

Arjun Narayan:
And so in some sense, what I was going for with that sentence... Although of course, we also have users and customers who are more... I have analytics and I want it faster, and I want it more real-time. We have some folks who are using Materialize that way too. But the real enemy to me, the thing I'm really going after is cache invalidation. That to me, is the enemy. You trying to figure out whether you have a... What is the data in your Redis cache? What is the data? When was the last time it was updated? When did that pipeline last run? Do I have to go rerun a pipeline to purge? Do I have to do cache purging on a regular basis so that it will go fetch it back from the source of truth and update the cache. That to me is the windmill that I'm tilting at. But [crosstalk 00:08:53].

Tristan Handy:
Absolutely. That's actually really interesting. There's so many problems that we think about all day, every day, in terms of low leverage engineering tasks that we want to have just completely go away with dbt. And I feel like cache invalidation is yours. It's like, no one should ever think about cache invalidation again.

Arjun Narayan:
Yeah, absolutely.

Boris Jabes:
I mean, I think if you zoom out why do those problems occur, I think we're all racing towards putting data in more places in more uses, in more ways. And so, I think Arjun's point becomes more problematic over time. I can see how the idea of making analytics faster is not where all the simplicity is gained, and having less variance on the same information actually makes their world simpler.

George Fraser:
Yeah, I think it's also part of a larger story of declining end to end latency of the analytic database stack. Data warehouses, which is the world we all live in, historically 24 hours was the update frequency. The classic way that you would configure your Teradata data warehouse is, you would ingest all the new data at night when no one was querying it. And so it was expected that everything would be 24 hours old. And you can do a lot with 24 hour old data. Don't get me wrong, but there's a lot of things that you can't do with 24 hour old data.

George Fraser:
And from Fivetran's perspective, we're the ones feeding these data warehouses. We came into existence when Redshift came into existence. And we were pulling data for all these customers from systems like Salesforce. Our very first connector with Salesforce, and systems like Stripe. And the traditional enterprise data warehouse, data pipeline just copies all the data every night, because anything else is too complicated, too error prone. So you just refresh the entire thing every night. And we immediately encountered that was not going to be feasible. These SAS tools, their API APIs are too slow to pull the entire data set even once a day. And so we built our connectors around change data capture from day one, and we did it because we had to.

George Fraser:
But then when we went and built database connectors, we also built them around change data capture, even though it was harder. We're a product company, we said we'll tackle that. And it had the side effect that if you configured the update frequency down to one hour or 30 minutes, or 15 minutes, or one minute, you had much more up to date data. And suddenly there were all these other things that you could do with it. And every other component of the stack has to change in order to enable that. Every incremental, 2X reduction and end to end latency requires every single component of the stack to make a major improvement. But I do think that is a unifying story across the entire analytic data stack, is just that latency going down from 24 hours, to one hour, to one minute, to below that. And as you go through that journey, there's more and more things that you can do with that tool set.

Boris Jabes:
George, do you think there's a weird... You know how with chips, you get to the point where electrons will start to melt across lines or whatever it is. Do you think there's a weird threshold beyond a certain point, where you can't just turn the crank anymore? You have to flip things.

George Fraser:
Yeah. I think when you cross the boundary from micro batch systems to true streaming systems that actually process data transactionally one row at a time, that is like a hard barrier. It's just much more expensive. It's almost like physics. You have to do certain things in order to process data a single row at a time. There's all of these efficiencies you gain with micro batch systems. And that boundary is at about one second. To go below one second, you have to pay a significantly higher costs. And there's just so much that you can do with latencies in the seconds range. I think that is an important stopping point. I often talk about that at Fivetran, that there is a limit.

George Fraser:
And all systems have latency. It always annoys me when people say real time or zero latency. What does that mean? You're looking at this, it takes about... I used to be a neuroscientist. It takes a couple hundred milliseconds for this to hit your visual cortex.

Boris Jabes:
Arjun, did you write this down? It took about 14 minutes for George to bust out his neuroscience background.

Tristan Handy:
Hey, Boris, George was talking about the early days of Fivetran and the change data capture thing. I learned something about Census, about your entire category recently. I read this blog post by... I think I'm going to screw up her name, Astasia Myers, I think, her name is. She wrote the primary on your space. I did not realize that the products that had come before you did not have change data capture reverse ETL back to the systems of record. And this is something that you folks have built in from the ground up as well, is that right?

Boris Jabes:
Yeah. I mean, when we started Census in 2018, this really wasn't a thing. I think there's this hangover to George's point about Teradata batch warehouses that were a day behind. Not only that, they were also very expensive once upon a time. And the idea that those could be a source for anything was weird. And so, the warehouse, I think, was by definition, a sink, to use distributed language terminology. And we just felt like, well, this is where all the interesting data modeling is happening. So we need to surface this out to the rest of the world. And it turns out if you build your product from the ground up for this, you can do it actually relatively efficiently. And then there's efficiency gains to be had on the other side.

Boris Jabes:
So George talks about this earlier too. These SAS products are not great. Many of them are really not great, and some of them are pretty good. But they benefit from sending data in the most batch way possible, because otherwise, the network latency cost you're paying on a per record iteration is just crazy. And so, that was the other difference. Every tool that had been built before was very much around moving events around in the business. So everyone probably in this room has used something like Zapier. And those are really, really easy to use, really fun. But they're at the they're effectively operating at an event a row.

Boris Jabes:
And the overhead of that for sending real amounts of data, it's just really bad. And so, we found that you got a lot of benefit if you could leverage what the data team was building in terms of models, and then move that up into SAS tools where work gets done. And that was my biggest... Look, George and I have known each other a very long time. One lesson I learned almost a decade ago, is that, people live in their pane of glass. In a company, people have their preferred tool. They just do. A sales person likes to live in their sales tool, and a product manager likes to live in some kind of JIRA like product probably. And so, you have to bring the data to them, you really have to bring the insights to them.

Arjun Narayan:
So you hit upon an interesting point, which is that the data warehouse just turned into this place where all the interesting modeling was happening. And I know, Tristan, you spent a while building dbt for quite a while. And for a while you were just helping analytics teams adopt best practices even before Fishtown was a real company, if I understand that right. I'd love to hear, Tristan, how over the last say, half decade to a decade, this modeling energy or this modeling effort has progressed centered around the data warehouse.

Tristan Handy:
People thinking about data modeling inside of the data warehouse?

Arjun Narayan:
Yeah. I mean, so to Boris' point, it just so happens that all the interesting modeling was happening in the data warehouse. And even though folks were thinking of it as a sink, it's quite useful to use that resulting models, those views of the data as a source for different set assessments.

Tristan Handy:
Okay. So this is big topic and it could go wherever on it, but I'll just start saying some words. There's this whole Kimball, Inmon data warehousing religious war that's been going on since whatever, the '90s or something and-

George Fraser:
'80s.

Tristan Handy:
... '80s, longer than I go back. And I think that the funny thing was that that religious war did not carry over into the modern data warehouse for a long time. Because it was only early adopters on Redshift back in the day. And so, you have a bunch of people using Redshift and Fivetran mode or Looker, and people are just doing fishing expeditions. There's all this data inside of Redshift, and then they're like, "I want to find out some answers." And then you go all the way down to the raw data and you try to figure out something, and then you come all the way back up to the surface. And you're like, "Hey, I came back with a report. And what do you know? It took me literally a full week to produce this one report and I forgot everything I learned in the process, but I got the report still."

Tristan Handy:
And it was only a couple of years into that process, where people started actually saying... And this was a big thing for me, because I was looking at this code that people were writing and just saying this is literally abysmal. And we like cannot continue to do this as an industry. We need to structure this better and have these successive layers of meaning, which is how software code is organized. And so, we've reinvented it ourselves. But then after a couple of years of doing it, realized, yeah, people have been doing this before. Maybe we should borrow some of the things that they were doing.

Tristan Handy:
But it was neat to have an environment where you could reinvent things from first principles in the new context of the modern data warehouse, because actually there are different constraints. And the original data modeling methodologies, some of it's good, but you don't need to take all of it with you.

George Fraser:
Totally. This is an important part of the story. When Fivetran came into existence, all of our initial customers were technology startups in the Bay Area using Redshift. And all of them had none of these preconceived notions about what was the correct way to use a data warehouse. I mean, there are people out there who say that loading data more often than once every 24 hours is wrong, and you shouldn't do that. And that was really important. It was this pool of customers who were open-minded and that created room, as much as the technology did, for new thinking and new ideas. And not all these things needed to be revolutionized.

George Fraser:
We came up with some new ideas, all of us who were involved in this space, we came up with some new ideas that broke some of the old rules, but they were good ideas. But then we also just rediscovered some of the same old ideas that were good all along and didn't need to be changed. My favorite example of this at Fivetran is history tables. So one of the things you sometimes want to do when you create a data warehouse is you don't actually want to replicate the data. You want to replicate a history of the data. And this problem has been around forever. When we implemented this, we implemented what is called, and has been called for 20 years, a type two slowly changing dimension history table. And we got basically the instructions for exactly how you should do this from a guy at a conference who is just a long time data warehouse guy.

George Fraser:
And he was like, "This is the way to do it." I've seen all the ways. I don't even know who this guy is. I should try to figure out who he was. But we basically, me and our VP of product, we were sitting there like, [inaudible 00:22:38]. And then you have to subtract an Epsilon from the timestamp. Why is that? Good point. And that is the thing we shipped six months ago after we finally finished all the underpinnings that were necessary to do that. So huge part of the story of all of our companies was the existence of this pool of customers who are willing to try new things. But then also some old things were good things that just got carried on.

Boris Jabes:
Tristan, you said something that I think is really subtle and deep fundamental about what you saw, and George have some history with this yourself, as I recall. Is it reminds me of big Spreadsheets. There reaches a point where the form does not allow... It doesn't exceeds the content. You can have a Spreadsheet that has a ton of stuff, but ultimately only one human being can even reason about it. And I once worked at Microsoft and I would see these people who are the custodians of the mega Spreadsheet. And they're like, "Don't touch it. It computes the perfect number somewhere in there. Good luck ever being able to iterate on it."

Boris Jabes:
And your point about... You'd see these people do these deep spelunking sessions come up with a report, and then that would get disappeared the next day, that entire knowledge that was built out is just gone. That to me, is the side effect of what dbt has caused is so great. Because now there's an artifact and you can collaborate, and iterate, and actually treat it like something that is shared and everyone can learn from, which I think didn't exist before.

Tristan Handy:
I think that's true. We think a lot about... I can't remember. It was either Drew or Connor that introduced this metaphor to me. But have you ever put a screen protector on an iPhone, and then you realize there's a little bubble, an air bubble?

Boris Jabes:
Yeah.

Tristan Handy:
It's maddening. And so you get out your credit card and you slowly try to push the bubble out to the edge. I mean, you can eventually make the bubble disappear. But problems like this... The Excel Spreadsheet is the bubble right in the middle. And then you're like, "Okay, now it's in Sequel, it's in code." But it's still in this 250 line report. That's slightly closer to the edge, but still not that much progress. And then now with DVT, yes, we've made a lot of progress here, but we still feel like we're pushing that bubble out to the edge. In software engineering systems, you have tech debt. Where before you didn't really have tech debt, because it was so dysfunctional, you couldn't even call it technology.

Tristan Handy:
Now we have these systems that people are trying to treat production systems, and yet there's not always the governance in place to know where's the stuff to deprecate, where's the stuff that doesn't have good test coverage. There's all these classes of problems that our ecosystem is not yet mature around it. And we feel a lot of pain around in.

Boris Jabes:
Yeah, I think we're at we're at the very... I hate to use sports analogies or whatever, but this is the first inning of data turning into a software artifact. Analogy I've used over the last couple of years is, because... When you plug in something like Census, that means the data is going to become live in a lot more systems than just reports, which means the downside is really bad. So if you screw up, you have a lot more responsibility, and that means you have to start treating this artifact as... You have to test it and validate it and make sure it's of high quality. And while you also modify it as you go along. And that's what [inaudible 00:26:47] teams do relatively well.

Boris Jabes:
And so, I think everything that we've learned about how to ship systems on the internet and do those while being agile, is the same thing that we have to learn for data. And so DevOps is the term for this, the term of art. And I think every aspect of that is entering into the data profession, and it's just really fun to watch. Because we get to catalyze that as much as we can.

Tristan Handy:
George, how long ago did you go to that conference and have the person tell you how to make a type two slowly changing dimension?

George Fraser:
I mean, it was two and a half years ago. History tables turned out to be astonishingly complicated to implement correctly. So there was quite a lag between when we knew what we wanted to do and when we did it.

Tristan Handy:
But it wasn't six years.

George Fraser:
Yeah.

Tristan Handy:
See you think that's a small number.

George Fraser:
Yeah. It takes time

Boris Jabes:
By the way, Arjun, to send it back to you a little bit. The tighter you make the loop, the Fivetran get the data in, the faster you can process data in micro batches of let's bring it down to seconds. It's interesting because now that you also have even less room for error. So is that a problem. Are people ready for data having to be corrected every step of the way if it's going to be so live?

Arjun Narayan:
I actually think this is a huge reason why people have been leaning so hard on dat warehouses for, essentially what I would otherwise term application use cases. The thing that Sequel or Sequel databases and Sequel data warehouses have been very, very good at, that other folks have played fast and loose with is correctness, data consistency in the asset sense of the term, the foreign key checks, the making sure that your data actually... That the tables that you're joining can actually be joined correctly against each other, that the slowly changing dimensions are handled properly. And data warehouses are the place where this has been done very well.

Arjun Narayan:
I mean, we've all at some point in the past few years gone on LinkedIn and seen it says you have 300 messages and you click it. And then there's two of them there and one of them that you read yesterday. And then it shows up even after you've read the messages, you reload the page. Because fundamentally, that frustrating cache invalidation problem is very difficult to solve anywhere at all. But the data warehouse has done a very good job of it. And so as these loops get tighter instead of demanding latency, I mean, we're going to see a pressure for data warehouses to lower the latency in which they can update their data. I mean, that's fundamentally a bat that Materialize, the company, makes as this plays out in the future.

Arjun Narayan:
But if you aren't using a data warehouse as a source of joining your data together and merging all of these dozens of SAS application data, where these customer data is spread across, where else can you do it in a way where you can actually stay sane? I don't think there's any other system that gives you this correctness guarantee as the data warehouse.

George Fraser:
This is a slight sidebar, but this reference to systems that have maddening consistency bugs. I have to call out that one of my personal favorite life accomplishments, this really might be number one, is that Arjun and I wrote a blog post, that if you Google search is CAFCA database? And that the top results is our blog posts saying CAFCA is not a database, which enumerates exactly why you need transactional systems for many use cases. CAFCA is a very good system that has many useful roles, but it is not a database. And some people think it is. And we wrote this blog post, which has gotten some solid traction. And it makes me happy every time I think about it. I want you to know that Arjun.

Arjun Narayan:
I do like, George, that a neuroscientist software engineer is most excited by their marketing accomplishment.

George Fraser:
Once you've been running a company for a few years, at some point it's all about pipeline. That's all you think about.

Arjun Narayan:
I hope the chat will be a little bit more subtle about it, George, by calling out LinkedIn's consistency errors, but you had to draw that out [crosstalk 00:31:45].

George Fraser:
I'm not a subtle guy, Arjun.

Arjun Narayan:
I'm really excited about the potential that you folks have for increasing the criticality of the stack. There's a lot of behavioral things that we really care a lot about pushing people on, like test coverage. I don't know what the number is right now. It's somewhere between a third and a half of dbt projects run tests at all. Which means that somewhere between half and two thirds of projects are completely untested, and yet have 100 plus models. And it's scary to me.

Boris Jabes:
It's super scary.

Arjun Narayan:
So, the thing that I really want is I want these models to feed data into production systems. And then I want to use that and say, are you really going to allow incorrect data in front of your sales reps, kicking off automated emails or whatever. It increases the level of criticality of doing things the right way.

Boris Jabes:
Yeah. I think of Census as a catalyzer for that. Because once the data's out, we're the last bastion, after which people are going to do crazy things. And what I find people do... What I try to teach people is start small. Don't try to start with making 100 models with each dozens of attributes live on day one. Build the feedback loop and start with a smaller set and test it. Be confident in it, and then expand as you go. The reason I use the DevOps analogy, is I feel like people should treat these models, these data and the logic that they build as something that iterates over time. And so what we have to do is give you that confidence to iterate. And you're right. It's amazing how far we are still in terms of testing and monitoring data. We're just at the beginning, really barely at the beginning.

Arjun Narayan:
I think that just goes back to the fact that historically data warehouses were used for reporting, and the expectation was that the human beings would check the correctness of these reports and go fix the Sequel queries if something was wrong. It's just in the same way that software engineering started with manual testing. Testing meant before you cut a release, a bunch of people would click every button and make sure that it works, and evolved to automated testing. It's the exact same evolution with data. As you go from releasing a report every quarter or every week, which is checked by a person before it goes out to powering an operational system every second, you can't do manual testing anymore. You have to do automated testing. And so you have to adopt all these best practices. The nice thing is, that software engineering came first.

Boris Jabes:
Yes, so we have all the cheat codes. We literally have all the cheat codes.

Arjun Narayan:
We can just copy, no need to innovate.

Boris Jabes:
No need to innovate.

Arjun Narayan:
Just do what they did.

Boris Jabes:
Yeah. For whatever venture capitalists in the room, that's your roadmap, is just insert every section of the software ecosystem into data and you're good to go. But it's true. I think we're at the beginning of it. And to your point, there was no need to be automated in how you test data until now. I've been curious about... I think it's scary to people to put data in the critical path. It's one of those, they're excited and they're scared. That's one of the things I've noticed with our product, is like you're taking on this responsibility that you didn't have before. Have you seen that happen? Have you found interesting ways to shepherd people through that, or is this just something that-

Arjun Narayan:
I mean, I can tell you some stories about that, about data being in the critical path in ways that scared me. I mean, when we started, when we were a much smaller company, our customers were mostly using Fivetran for reporting. I thought that's what they were all doing with Fivetran. And Fivetran when we were a much smaller company was not as reliable as it is now. Sometimes something would change about the source API, and Fivetran would be down for a few hours or sometimes for a few days. But if you were doing reporting, your reports were a little stale. It was not good, but it was not the end of the world. But then we started getting bug reports. If Fivetran was down for just a couple hours, people would freak out.

Arjun Narayan:
And we were like, "Why is this such a big problem for you? Okay, it's not great that we're down for six hours today, but it feels like it shouldn't be the end of the world. Your dashboards are six hours behind." And then we found out what people were doing with the data. People were running payroll off of the data. People were billing their customers off of the data. They were like, "We can't pay our employees until Fivetran is back online." And I was like, "Oh, my God! I did not realize you were doing that with the data." And ultimately, it was a challenge that we just had to rise to. We had to say, "Okay, we are in part of an operational system now. We need to have the kind of reliability that operational systems have." But realize that is a foreign concept to the data warehousing stack.

Arjun Narayan:
That was not a thing that people expected in the old days. But when you had these users who look at Redshift or Snowflake or BigQuery and say, "This is the database, I can do whatever I want with the database," and they don't know these rules, they just start doing stuff. And sometimes it pushes you in a great direction.

Boris Jabes:
... by the way, Arjun, that's a good piece of wisdom for Materialize. And I remember George hitting this issue a couple of months ago. And we've hit this with some of the best warehouses in the business.

George Fraser:
Let's name and shame, guys. Which warehouse? Which of the three is actually good for production use case?

Boris Jabes:
I think more that they're not necessarily great at providing information about outages, because they themselves are also not thinking about the fact that they are truly powering live operations. And I think it doesn't matter that your technology can compute a new records in sub-second latency if you don't also provide customers the trust that your system is super trustworthy in terms of up-time. And that is hard.

Arjun Narayan:
Well, so one of the things that... It's not just about up-time. But these data warehouses, they have built in this model that you can take things down and put them back up. But in the operational path, you have all these considerations around operational features that are required, like say online schema changes. So schema changes, adding a column in a way that can be done without having to essentially kill everything and recompute it from scratch. It's quite surprising how late in the game this even shows up even in OLTP databases that have been built to use very, very high up-times and the ability to do things, like add columns. I'm curious if you have... George, you've seen some of these data warehouses rise to the occasion on that front, adding in features that make them suitable for application.

George Fraser:
I mean, we see the backend of all these data warehouses, and we have so many customers sinking all kinds of different data sources into these data warehouses. We see them at their worst. We see everything that happens. And I will tell you, in the early days of Fivetran, our whole concept of an automated data pipeline that would just propagate changes, particularly schema changes as you call out through to the data warehouse, barely worked. I mean, in the beginning, we were just supporting Redshift as a target. And Redshift, when we first started would crash a lot when we would do schema changes. Because they had assumed that all schema changes were going to be done by a person.

George Fraser:
So they do it at night. They do a handful of... Every Sunday night, you'd see a handful of schema changes. But because this is an operational system now, we have to just pass through whatever we see. We have customers who do a million schema changes a year. They get into crazy numbers. And so fortunately, it got better. We saw that happen. We saw the data warehouses improved because we were not the only ones doing this. So our code, for example, if you look at it, for talking to the data warehouses, got significantly simpler over the years, because we didn't need as many failure conditions and retry loops, because we saw that they were just getting more reliable and there were fewer bugs in the data warehouses over time.

Boris Jabes:
I'll be happy to tell you, and George knows this. The SAS products are not quite yet.

Arjun Narayan:
No. I do an onboarding talk for new folks that join our team. And I love to talk about those early days too. So one of the stories that I tell is that, so in dbt you can materialize your datasets as views or tables primarily. And the reason that we... Originally, we were dealing with reasonably small clients and had small data sets. And so views were basically always preferable to tables. Because whatever, it's always real time if it's a view. But we found that Redshift would experience these really horrible compiler errors that returned unintelligible error codes when we stacked 10 views on top of one another.

Arjun Narayan:
And so the whole reason that we would use the table materialization back then was, when we ran into a Redshift compiler error, we were like, "Yeah, put a table in there." And that was literally the reason for that dbt feature at the beginning.

Boris Jabes:
Well, I mean, come on, there's a lot of... I can try to think about what features have been left in our product. Having real users and real systems you depend on really get in the way of perfect cathedral.

Tristan Handy:
Yeah.

George Fraser:
I'm sorry to tell you, Boris, that... Because we face both sides. We talked to APIs as sources and warehouses as destinations. The warehouses have gotten better, but the API aren't getting any better, Boris.

Boris Jabes:
I agree.

George Fraser:
It's the same thing from here on out.

Boris Jabes:
No, it's fine.

George Fraser:
I like to tell people that there's this funny principle we observe, which is that the most successful companies tend to have the worst APIs. It really is true. And I swear it's because if you have the perfect API, it means you were focused on the wrong thing. You weren't focused enough on sales. And so what you see is these mega successful companies have just terrible, crazy APIs. The one exception is Stripe, because their product is their API. They are a extremely successful company with an API that is actually good. But otherwise in general, the more successful [crosstalk 00:43:45].

Boris Jabes:
By the way, George, the API quality is good. That's right. But believe it or not, it's not that impressive as a receiver of data, believe it or not.

George Fraser:
No. We only read. So that's [crosstalk 00:43:57].

Boris Jabes:
I was actually somewhat surprised. But I think that's a really good point. I think famously, by the way, Expensify, at one point they're just like, "Screw it, we're canceling the API." They just stopped supporting it. They just turned it off. They just decided, "We don't care anymore as a business to deal with API clients," which is really something. George and I have known each other a very long time, but in a past life I had to automate employee identity management and log-ins to thousands of SAS apps. And that's also a rude awakening in how inconsistent products are here and companies. And again, I think George is right. If you're working on that, it's because you're you not working on the things that matter to your core business. These are nice to haves.

Tristan Handy:
We were riffing on the whole software patents, and bringing that to data. The thing that I am really curious about is if, what do you folks think about the people involved in this process. My hope is that there's this new class of humans, who at one point in time, we would have called data analysts that now are empowered to, essentially orchestrate production workloads, not just analytical workloads, but operational workloads too. But they're not classically what you would think about as software engineers. But maybe as this stuff goes deeper and deeper towards production systems that have more and more criticality, maybe you get software engineers much more back in the mix. I don't know. I'm curious [crosstalk 00:45:53].

Boris Jabes:
Tristan, I think this is a really good question. And by the way, I think this is a good note to go. If anyone wants to raise their hand or whatever at this point and talk about this, this is probably a good place to let people chime in. But a Tristan, I think about this a lot. dbt has really leveled up the practitioner. And then because of the transitions in the infrastructure, now we have to think about, I even think about what's the composition of the team. Where does it sit in the organization and what is in it? And for sure, there's emerging, I feel like almost three layers of people within a data team. There's low level data eng, which is what you and I would call it capital E engineering, software engineering, people that may even own things like George's favorite product CASCA.

Boris Jabes:
And then you've got your analytics engineering organization at this point, or practitioners who are maybe building core models. And then you still have a data science team. That is doing ad hoc discovery, predictions, analysis, et cetera. I don't even know exactly where to... I feel like AI and ML fit into that mix somewhere. And then where does that report up to? I mean, I have seen BI teams to this day that report up to a CFO. That's pretty fricking common. And if you're going to become more engineering minded or you're going to mix in with engineers, you really can't report up a CFO. That doesn't make sense. And so, is that a dotted line relationship? Actually, I'm not sure.

Tristan Handy:
Most analytics engineering teams today don't have things like on-call rotations. They don't use PagerDuty.

Boris Jabes:
Right. It's just a matter of time.

Tristan Handy:
Maybe it is. But also that could be a rude awakening, when you say, "Hey, you added the term engineer to your title and welcome to weekend PagerDuty."

Boris Jabes:
Yeah. Look, one of our early customers, she started using Census and it was amazing. And she started pushing data into our marketing tools, and sales tools, and all these things. And then she had this realization, it hit her three months later, she was like, "Now it's my fault if it's wrong. And please, Census, don't don't fall down." But more importantly, she didn't make the leap after that, that, "Should I be on call?" That didn't exist in the mind frame yet. And I might have to, and then who's the leadership in charge of that. Are we going to popularize a chief data officer or is that going to report up to the CTO, or what? I actually think it's in the air, it's it's up in air.

Arjun Narayan:
I think there's a lot of things that are changing here, but this is really one of my favorite things about the work that we're all doing, is that we're expanding the scope of what you can do if you know Sequel. We're allowing people who know Sequel to do a lot of the things that previously you would have had to be a software engineer to do. And I think that's a really great mission, is because that's just such a larger set of people for whom that's a realistic goal. And there's something really great about that, about bringing that power of software engineering to a larger group of people.

Boris Jabes:
I agree. And selfishly, when "go to work" in the morning by going from one room to the other room in my house, that's what's most exciting, is watching people have that epiphany that they're now becoming a... They don't even know to call it an engineer, but it's they're becoming this leveraged, more empowered person with knowing only the language that they already had learned. It's really quite something to see.

Arjun Narayan:
So Matt has a question for us. Matt, please introduce yourself.

Matt:
Hey, guys? Hey, Arjun? Hey, Tristan?

Tristan Handy:
Hey?

Matt:
So I'm Matt. I lead the, I guess, analytics engineering team for merchant facing models at Shopify. So all the merchant facing reporting that... Very well, the reports we show merchants. My team builds the spark and streaming models. We could that do that. And I was just going to weigh in on the analytics engineer conversation. So, traditionally, I guess my team even today is very much part of Shopify data organization, and we do report up to the CFO still. But more and more, we want Shopify to be more and more data powered, where every area of Shopify operationalizes the data in some way.

Matt:
And so the vision that we see or that we're trying to work towards over the next year, say, is where more and more building of data experiences and data pipelines is just as much a responsibility of backend teams or part of what the engineering org does, rather than something that's just part of the data craft. I would imagine that in the future, our data org would continue to have what we call production engineering, which is that big E engineering as you referred to it, as well as the data science role. But more and more, I think we're seeing that migration into more of the the engineering part of the business.

Matt:
And part of that is so that we can have vertical teams that don't just... So that it's not just one team that captures the data and another team that models it, but where the team that produces the operational data is also the team that maybe models. And so, there's more resiliency there as well, in these vertical sections, the products group at Shopify, the orders group at Shopify, they do both the operational and the analytical.

Boris Jabes:
Matt. That's super interesting. I mean, it's also nice to have a center of excellence. And you guys [inaudible 00:52:03] look crazy.

Matt:
This is exactly the kind of things that we're trying to figure out, and trying to figure out that balance. And I don't think we have an answer yet, but just thought it might be interesting for folks to hear that's what we're also struggling with.

Boris Jabes:
Absolutely. And I want to make... Tristan, I'll let you go, sorry, one sec. I would say what you just said, Matt, is you're at, what I would call the extreme of operationalizing data, because you're operationalizing it in your application, in the product that is Shopify?

Matt:
Absolutely.

Boris Jabes:
That's the most live ammo you get. I call that the final form of operationalizing.

Tristan Handy:
Matt, one of the things that I find it interesting is that... And I wonder if you've experienced this at all, or if you folks have pushed further on this cultural dimension. So if you look at software engineers, in the '90s, most software engineers wouldn't have really had much exposure to things like design thinking and how do you design user-centric applications. And I think that a lot of software engineers today don't actually have a lot of focus around data products. If you say, how would this data be shaped in a way to make it usable for downstream data consumers? I don't know that that's an answer that a lot of software engineers today know how to give. Do you folks think at all about this?

Boris Jabes:
I mean, testing used to be something that engineers in the '90s didn't want to do.

Tristan Handy:
Well, that's true too.

Matt:
Yeah. I'd say that it's a craft that we're developing. So this, let's call it analytics engineer. Let's put it in the engineering group, let's say that you specialize in things more like Spark, and Kafka, and Parquet, and Flink maybe, these sorts of things materialize, big table, and maybe it's somebody who previously was a Ruby on rails engineer and they just want to do a bit more data stuff. Currently we don't have... We want a culture of really strong data principles, but here's my facts, here are my dimensions, really good BI 101, or your Kimball, Inmon kind of thing. But that's definitely a culture within engineering that we're going to need to build and that we haven't done that yet.

Boris Jabes:
By the way, Tristan, talk about a responsibility. If we, as a group here are up-leveling people who just know Sequel and endowing them with the title of engineering, and just wait until the gatekeepers with CS degrees in the CTO's office are like, "Who are these people and why are we giving them the same access that we have." It'll be very interesting.

Tristan Handy:
"And what other compensation bands again?"

Boris Jabes:
That's good. That's fine. They could use some competition. That's good. Ben, you had a question?

Matt:
I'll step down. But if anyone is interested in those problems, feel free to reach out to me.

Arjun Narayan:
[crosstalk 00:55:15]. Thank you so much, Matt.

George Fraser:
I got to do it. It's my job.

Ben:
Hey, guys? Thanks for bringing me up. I am on the go-to-market operations team at Culture Amp, and I'm just getting involved in Sequel, in Fivetran and dbt and playing around and growing my skills. And so, I'm hearing that conversation earlier about empowering more and more people with Sequel was interesting to me. I wanted to ask if empowering people like me who are just learning Sequel and just getting into data analytics, if that's driving all of your product direction, are you moving your product towards more specialized roles?

Boris Jabes:
I mean, I definitely think about it. I think of it as like, how do I increase it your leverage? When I started working on Census a couple of years back, I just found that people in, whether it's BI or operations, we're just doing a lot of repetitive work that was not at scale. And if you can know Sequel and affect your business with those skills in ways that are actually magnified, that's amazing. My goal is to try to magnify your impact on a company, which is what engineer... That's why I think, Tristan, endowed this persona with the title of engineer, because I think engineers are the ones who are always thinking in this way.

George Fraser:
Yeah. On the Fivetran's side, we're doing the thing that we've done for years, which is pipelining data, including four years for Culture Amp, long time Fivetran customer. Thank you. And we're trying to make the up-time higher, the latency better, more sources, more complete schemas. We're also trying to help with the data modeling side of it, primarily by embracing dbt. We're trying to be the biggest supporter of dbt other than Fishtown Analytics. And that takes the form of writing pre-written dbt packages for our connectors, and thinking about scheduling and orchestration. But that's the other vector we're on these days, is trying to figure out how do we just get on the dbt train and make that the standard way for modeling data in the data warehouse.

Boris Jabes:
Plus one on that obviously. And you just have more humans to throw at it right now, George. I don't want this to be the Tristan love Festo.

George Fraser:
I'm not contributing to this. Tristan and I used to be enemies because he was at Stitch Data long ago. This is one of the funny things that happens in technology, is that it's this small cast of characters and it's the same characters year after year. And you find your past enemies are your new friends. It's funny how that works. [inaudible 00:58:16] and Jeff Bezos combating each other over a Space Next. It's new game, same players. There's this Lakehouse thing that's happening and it's actually bridging the worlds of Sequel users and more traditional data engineers who might use Python or other languages.

George Fraser:
I for a long time had been this big Sequel or die, [inaudible 00:58:53] on my leather motorcycle jacket. But I increasingly think that the reason that Sequel is important for me is because it has scaled incredibly nicely along with the Cloud data warehouse. And so it takes away all the operational overhead, which is such a waste of time. And if a Snowflake or Databricks or whoever can give me other languages that have these same operational properties, I'm excited to grow the community of people who are participating in the modern data stack. I wonder if others feel the same way.

Tristan Handy:
Yeah. I mean, I think the two companies you see making big moves in this area are Databricks, is coming from the direction of having really good support for procedural languages like Sequel and Python, and making their Sequel better and better. And then Snowflake coming from the opposite direction, being best in class for Sequel workloads. And now what's Snowpark, which is basically a way to do procedural programming in Snowflake, which is still in private preview with a few customers. But everyone will-

George Fraser:
Have you used it?

Tristan Handy:
... I think, have access to it soon. No, I have not used it. I've talked [crosstalk 01:00:08]-

George Fraser:
I haven't either.

Tristan Handy:
... are working on that, but it basically represents.

Arjun Narayan:
George, so I want to push back a little bit in the sense that this is [inaudible 01:00:19], and I've seen some collecting about this from folks I respect who are like, "We're just working our way back to a world in which all of our business logic was encoded in PLPG Sequel," and the use of triggers that form some ludicrous set of cyclic dependencies that nobody can understand, and nobody even knows what's going on. And what's different this time? How is this not going to end badly the same way if we're just dumping everything on a single database or data warehouse and layering the set of triggers?

Arjun Narayan:
One of the things I like about Sequel, what I liked about view materialization as a concept is, it scales pretty well. If you had 30 views that were all in this complicated directed acyclic graph, you can still make sense of it in a way that you can conceptualize it. You can have views defined and provided as a service by one team that another team can build another view on top of. But once you start mixing procedural languages and fundamentally triggers on top of this, this becomes ugly very quickly. How do we stay sane in that world?

George Fraser:
Arjun? I agree with you that I wrote some of that... What was it? PLPG Sequel. I can't remember. But that was what I did in the early 2000s. But I think that the thing that I'm excited about in here is, you can still do this well and create a positive construct, I think. So, if you imagine something similar to a pandas dataframe, where your transformations are applied by chaining multiple operators one after another. But still fundamentally what you're doing is, you're describing a dataset in code. It's not super procedural. It's just like using different syntax to describe transformations. I think that you can appeal to a new audience who's more comfortable in that way of expressing the construct. I don't know. Certainly it opens the door up to doing a bunch of stuff very poorly.

Boris Jabes:
Yeah. I think, Arjun, there's risk, of course, in everyone coming to this area and potentially writing bad code or code that can't be reasoned about at some scale. I mean, what I've seen though is, if you expand your universe to include what I look at, which is the entire business processes, it's the wild west out there. The only thing that saves people is that there's some amount of separation so that some functional units can work outside of other functional units. But the logic in the business is really haphazard ad hoc pointing, clicking some UI and a lot of imperative code. They don't realize they're writing imperative code because they're... It's called a zap, but it's imperative code.

Boris Jabes:
And I do think if we can move people towards what I think of as functional programming, which Sequel is a great example of, where it's like you're actually refining the data, you're putting it through this refinery process. It should be better than what we have, is my view. But I think there's always a danger you're right about that.

Arjun Narayan:
I think the short answer to your the question we posed to our agenda is, we will repeat the errors. And the nightmares of the stored procedures, we'll write again. The hard thing here is that there is a community of people that does not like Sequel, that prefers to do their work in a procedural programming language, like Scholar, or Python, or something like that. And it's hard to say no to them, to say you can just go use Kafka. If you want to use our system, you have to write everything in Sequel. So I think it's inevitable that we're going to have this support for procedural languages as well. And it's also inevitable that we're probably going to repeat some of the errors of the past.

Boris Jabes:
That's [crosstalk 01:04:54].

George Fraser:
And it's going to be a lot of consultants making a lot of money and fixing them.

Boris Jabes:
I guess that's how you know we'll have made it, right? If we have a consultant ecosystem that makes a ton of money fixing the problems we have wrought. Arjun, that's a good... By the way, I should say this, thanks everyone for sticking around. I'm Boris over here, and we've got Tristan and George and Arjun, for anyone who joined a little later. I you have, really at this point, I think arbitrary questions or things you want us to cover next time, just raise your hands. This is a friendly audience.

Arjun Narayan:
Sorry. I don't know how to vamp.

Tristan Handy:
George, you've had a take on this. This is something I've been interested in. I have a question of my own. There is a question from the audience, so I'll let them go first.

Boris Jabes:
Look at that. Let's hear from Josh. Hey, Josh.

Josh:
Hey, everybody? Can you hear me?

Boris Jabes:
Yes.

Josh:
Okay, great. Well, first of all, very informative discussion. Thank you all so much. I have a two-part question, and I think it's primarily directed to Arjun and Materialize. Tristan, you were talking a bit before about the problems many years ago, stacking views and Redshift. I don't know, for me, on BigQuery, that was a problem three months ago. So I managed to break [inaudible 01:06:23] reviews, because I was trying to do this streaming processing, essentially, and some on pivoting and some other things. Obviously, some of this might get a bit into the secret sauce and materialize, but I'm just trying to understand.

Josh:
At some point what you're doing is essentially stacking the equivalent of a bunch of materialized views on each other. And if so, it seems like a lot of the platforms like Snowflake and others, the challenge with Materialized views has been, you look at the details and it's like, well, you can't do Window functions, and you're limited in your aggregations, and you can't do joins. And you're really stuck there. So I'm not phrasing this as the best question, but I guess I'm just trying to understand better how it would work in Materialize and how you would get around some of that. Because if you could really combine all those worlds, it just seems like it would be amazing. Does that make sense?

Arjun Narayan:
Yes.

Tristan Handy:
Arjun, how often do you get the, it's too good to be true.

Boris Jabes:
Yeah. That was a plant, right? Arjun, you paid Josh for this, right?

Arjun Narayan:
We get that a lot. Thanks, Josh. I think you'll see this on Hacker News when we announced the dbt Materialize-

Boris Jabes:
The tech is in the mail.

Arjun Narayan:
... there's some of that sentiment too. I mean, the real answer is that materialized views have been this longstanding desire. Pretty much all materialized view implementations that I'm familiar with are just extremely hamstrung in their capabilities. But the biggest one being you can't do joints, there's all sorts of fun ones. You can do sums, but the materialized view is marked as invalid. You can do max. But if anyone ever deletes a row, then the materialized view is marked as invalid and must be full refreshed. Because if you remove the max, then nobody knows what to do. And as a result, you either have very, very small, limited implementations of people actually using materialized views or not at all. Because it's just a risk that the DBA says, "This is too expensive. I can't really understand the performance characteristics of this thing. So don't do it."

Arjun Narayan:
The founding of Materialize is a very interesting story, in that, we didn't start by saying we want to make materialized view as well. Frank, who could probably tell you some of the... If he is the audience, he could probably come up and tell you some of the story of the original research, was building this next generation stream processor called timely data flow. And on top of that, he had built this fully incremental data flow system called differential data flow, which had all the basic relational building blocks and many more that would basically always do incremental compute. So it's not really about stream processing as it was about incremental compute. The timely data flow layer was this amazing next generation stream processor. But on top of that, he had built an incremental computer engine.

Arjun Narayan:
And I looked at it and I was like, "Oh, my God, this is the thing that finally makes incremental computation for arbitrary classes of computation possible and harassed them over several years into reluctantly commercializing the underlying technology. I think there's something that George has been talking about that, even when you don't need that millisecond latency, it's the fact that you frame the computation, you do the work incrementally that's very powerful. So even if you're doing it one minute by one minute, if you're only doing work proportional to the rows that have changed in the past minute, you can get meaningful efficiency, improvements and meaningful latency beyond that, that update frequency, so should you end up doing more computation, more interesting computation.

Arjun Narayan:
And so we packaged it up as materialized views. But the intention was the original story behind the underlying technology was that the technology came first, and it took us several years before we realized... Or I took the stance that the best way to package this for users that they would understand. The value of incremental computation was to just package it up as materialized views that for the first time were arbitrarily capable. And there's a couple caveats. We haven't solved all of it. I think the biggest one that we still haven't really cleaned up is Window functions. So we do the 12 way joints, we do the sub query, we do the correlated, uncorrelated, all sorts of sub queries. Window functions ended up being a little tricky, simply because they can create a ton of intermediate states because fundamentally if you have real number for each row and you remove a row, each flapping about a ton of intermediate state, which we could naively support, and then it would just...

Arjun Narayan:
They would have that same frustrating property that would eventually get a DBA to shut it down and say, "Please don't write that code. It'd destabilized. It's an old system." So I've gone off on my soapbox a little bit about the history of the Materialize. Does that help, Josh?

Josh:
Yeah, it does. I mean, it's funny. Because I think the thing that I did that broke the stacking views on BigQuery, which was a bunch of window functions, which BigQuery is also notorious for having issues with. There've been some other recent Slack threads about that, but I that's very helpful. It also gives... I also always appreciate when someone doesn't say, "It does all the things magically," but actually you did give an in-depth engineers response of, well, it does most of the things and here's the one or two problems, because that actually makes me much more confident that it really is a transparent view of what's going on. So, that was very helpful. Thank you.

Boris Jabes:
TJ. Thanks very much.

TJ:
Okay, cool. Now that the sales pitch is over. It could have been summarized as it will just work and the rest is details. No, just kidding. So I want to ask this question. I'm TJ, I'm the head of data for the largest streaming site you never heard of. Two of the people of moderators know what that means. My question is you all make the problem worse. And so I wanted to ask about it. How much is data provenance an actual issue that will need to be solved in the world where everyone has adopted the modern data stack, where data goes from Postgres, to Fivetran moves it to the data warehouse, to dbt transforms it to something else, to then gets exported by Census to Salesforce, whatever else. How much is data provenance a real issue that people will be solving, or is it just a problem people have invented?

Boris Jabes:
I think provenance is a fancy problem. I think the main problem we encounter or that I see in the wild is, do you have overlapping readers and writers of the same information? That tends to be where your problems are. It's more that if you could just write those down, you'd probably be okay. And then try to limit where the overlap is. But I haven't had... I don't know. I might be with you, that provenance of data is not... I think at the end system, people want to know. The more the system is distributed and unclear how things get where, then it's a problem. People have asked us this, "Census send data to Salesforce." "Hey, there's other sending data in my Salesforce. Can you tell me what those are?" Because they don't even know because it's these random things that have been plugged into it.

Boris Jabes:
So the more you create a single, let's call it end to end pipeline, the less you have that problem. And then I think it's more of what is your coalescing strategy for things where there's a multiple writers. But I don't know that provenance [inaudible 01:14:51].

TJ:
Okay. Tell me, George.

George Fraser:
Well, I think I would define provenance as the problem is you're looking at a number and you want to know where did this come from? And I think in the example of what Census does, sending data to places like Salesforce and Marketo and stuff like that, at the moment, this problem is hopeless. These concepts don't even exist in those systems, so there's nothing to be done. I think that in BI tools and within data warehouses, there's things to be done here. It is true that systems like Fivetran and dbt by making it easy to get data from so many places and write so many transformations stacked upon one another, makes this problem worse. You get to do lots of good stuff, but now trying to figure out where did this number come from? It's very complicated.

George Fraser:
There is hope though. Because these systems are based on code and they're automated. You can, in principle, trace this a lot better than you could have with the previous generation of tools, which were a lot of Python scripts, just loading CSPs from here and there. So I think there's a path to solving this. I mean, this is something interesting. And I talk about a lot, is how do we connect the dots here? Fivetran knows all these things about where the data came from. dbt knows all these things, about how the data was transformed from stage to stage. How do we connect these dots in an automated way so that as a user, you don't have to do anything, but you can do getting back to just bringing the concepts of software engineering to data, find ancestors of [crosstalk 01:16:37].

Boris Jabes:
George, hold on. I'm going to push back. When you use a piece of software, you don't need to know... The software presents to you some information, you just take it as correct. And you don't need to know the why that answer is that answer. When you're debugging it, sure. But as a consumer of it, do I really care?

George Fraser:
Yeah. There is point in which you forget all of this, and you just have the number and it better be right. And maybe Census, sending data back out to operational systems is, Census needs to work correctly. But other than that, you're past that threshold. We're now outside the gate of user written code, and so this problem no longer exists. But when you're inside the data warehouse, it would be really [crosstalk 01:17:19].

Boris Jabes:
Yes. I think you're right. Inside the software. Sure.

George Fraser:
If I could look at a column and say, please tell me where this column came from, going all the way back to Salesforce or whatever it came from.

Boris Jabes:
Yeah. You want a stack trace. I'm with you. I agree. That is nice.

George Fraser:
I [inaudible 01:17:32] you. If you're a software engineer and you go into this world, you're like, "Oh, my God! I can't actually do code navigation. None of these things exist here. How does anyone ever debug anything?" But I think there's hope. Because the more we have automated systems like Fivetran and dbt that are principled in the way that they're built, the more it actually becomes possible to solve this problem.

Tristan Handy:
The thing I worry about when it comes to provenance is that people want it to be very... It's one thing or the other. And so, you think about, okay, where did this... There's a process in modeling that you frequently call stitching. You take a concept and frequently it's a concept of a person, a user, or a customer, or something. And you stitch the information that you have together from multiple different systems. And sometimes you actually have to create a pseudo customer ID from three different systems. We actually do this. Unfortunately, we have three different payment processing systems internally at this very moment.

Tristan Handy:
And so when you ask the question where does this customer ID come from? Literally, the only way to parse the answer to that question, you can't actually say, "Well, it's from Stripe," or it's from whatever. You must read the case statement, or the coalesce statement that instructs that customer ID. And there's no way to get simpler than that.

George Fraser:
Totally.

Boris Jabes:
That's a good one.

Tristan Handy:
The why the answer to this looks like the tools that software engineers use. Show me the dependence of this... Show me the upstream references of this entity and show me the code, because that's all you can show.

Boris Jabes:
Hey, George, wouldn't it be nice though if Sequel as a language could be analyzed in that way, because [crosstalk 01:19:24].

Tristan Handy:
We'll get there.

Boris Jabes:
Yeah, I think that's a really good-

Arjun Narayan:
Boris, I built that, but I won't bore you with it.

Boris Jabes:
... what? Okay. I mean, that is not boring to me, but we'll get into that some other time, you and me. Because I find that very interesting. By the way, Tristan, you were talking about coalescing across three billing systems. Here's where it gets messier, people don't necessarily have strong identifiers at all. We ended up in this situation where we see people going like, "Well, can I just do some buzzy thing?" And I'm like, "I don't know if I want to give you that power." Actually, I'm trying to tell no, this is bad for you. But at the same time, you're just trying to help people get something done and they're like, "That just doesn't happen for me."

Tristan Handy:
Yeah, that is so a thing. Sometimes the answer is, "The answer to this is extremely complicated because you need to make choices in your business about what a customer is. You need to decide on a definition on a customer. This is not a technical problem. This is a leadership problem at your company."

Boris Jabes:
Yeah. That might be when you zoom way up, that really does come up a lot, where people just don't even know what the definitions are. I heard a really good line once. TJ, you might've seen this in your work too. In some ways, the definition is correct when everybody is equally unhappy in the business, that's when you've got it.

Tristan Handy:
I think we've got a question from Josh here.

Boris Jabes:
Josh, are you back? I think he just wants Arjun to get more... We might just be encountering clubhouse here.

Tristan Handy:
[crosstalk 01:21:19] consistent.

Arjun Narayan:
I know that everybody else in all of clubhouse land is on the West coast, but I'm on the east coast and it's fricking 10:23 guys.

Boris Jabes:
No. I think we just got into it and I think it's very reasonable to say this is the end of our inaugural episode with... Now we'll have to do the Sequel to the Sequel. But I guess, I'll say, thanks everybody for stopping in. TJ, thanks for your question. And thanks everybody else before that, for your questions. And we'll do this again sometime. This is hopefully the first of many.

Arjun Narayan:
This has been fun.

Tristan Handy:
Awesome. Thanks everybody. See you.

George Fraser:
Thanks everyone.

Boris Jabes:
Thanks everybody.

TABLE OF CONTENTS

Full Episode Transcript

Related articles