Tutorials

4 methods for exporting CSV files from Databricks | Census

Michel Zurkirchen
Michel Zurkirchen October 06, 2021

Michel Zurkirchen is an Amsterdam-based digital analyst and a regular contributor to the Census blog. After starting his career in digital marketing, Michel started learning Python and has been an enthusiastic learner and practitioner of data analytics ever since. Amsterdam, North Holland, Netherlands

What you'll learn in this article: How to export a CSV from Databricks using the four following methods:

  1. Databricks Notebook
  2. Databricks command-line interface
  3. Dump Tables via JSpark
  4. External client tools

If you’re reading this article (hi 👋), chances are you have some reason to want to export dataframes to CSV files for data analysis. You might just be here because someone on your team needs a quick snapshot of how a recent ads campaign went for data analytics, or because you want to aggregate some event data about customer behavior.

Regardless of what use cases you're fueling or why you want to export your dataframes as CSV files, we’re here to help (if you’re a Databricks user, if not check out our other tutorials here).

As you already know, Databricks offers a lot of awesome functionality that makes it a powerful tool for most teams, including the ability to export dataframes to CSV files. Unfortunately, this feature is kinda difficult to get to. Rather than continue to try and puzzle it out, make your life easier and check out these four ways of getting the job done.

Method #1 for exporting CSV files from Databricks: Databricks Notebook

Databricks Notebook is Databricks's version of an IPython Notebook and comes with the same functionalities, such as manipulating and exporting data. Once you're done manipulating your data and want to download it, you can go about it in two different ways:

  1. Downloading your full dataset (if you have less than 1 million rows) using two lines of Python.
  2. Exporting your dataset to DBFS (if you have more than 1 million rows of data) and then downloading it using two lines of Python and a non-intuitive approach (AKA an admittedly wonky URL). I’ll explain more in a sec.

The first (and easier) method goes like this.


# Loading a table called fl_insurance_sample into the variable dfdf = spark.table('fl_insurance_sample')

# Displaying a preview of the data display(df.select("*"))

Underneath the preview, you'll see a download button, with an arrow to the right. If you click the arrow, you'll get a small dropdown menu with the option to download the full dataset.

If you have less than 1 million rows, you’re all set! 🎉 Pat yourself on the back and go get another coffee.

If you have more than 1 million rows, you’re going to need the code below instead.


# Loading a table called fl_insurance_sample into the variable dfdf = spark.table('fl_insurance_sample')

# Storing the data in one CSV on the DBFSdf.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/fl_insurance_sample.csv")

If your dataset is large enough, Databricks will want to split it across multiple files. Using .coalesce(1) forces Databricks to write all your data into one file (Note: This is completely optional). .coalesce(1) will save you the hassle of combining your data later, though it can potentially lead to unwieldy file size.

Using .option("header", "true") is also optional, but can save you some headache. It tells Databricks to include the headers of your table in the CSV export, without which, who knows what’s what?

Other options that you may find useful are:

  • .option("delimiter", "your_delimiter"): Define a custom delimiter if, for example, existing commas in your dataset are causing problems. If this is the case, try a different character, such as "|".
  • .option("compression", "gzip"): Compress files using GZIP to reduce the file size.
  • .option("nullValue", "replacement_value"): Replace null values with the value you define.
  • .option("escape", "escape_char"): Escape specific characters.
  • .option("encoding", "utf-8"): By default set to utf-8.

You can chain these into a single line as such.


df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").option("compression", "gzip").save("dbfs:/FileStore/df/fl_insurance_sample.csv")

Once you've exported the CSV file to the DBFS, you can navigate to it by altering the following URL: https://<instance-name>/files/df/fl_insurance_sample.csv.

If your URL contains ?o= followed by a set of numbers, use https://<instance-name>/files/df/fl_insurance_sample.csv?o=###### instead, replacing # with the numbers in your URL.

Yes, that's the official way. Yes, it's weird that this is the default way to do it. No, it's not intuitive.

Regardless, it works. But in case you don’t want to deal with it, here’s another method.

Method #2 for exporting CSV files from Databricks: Databricks command-line interface

Databricks lets you do a great number of things through the command-line interface (CLI), including exporting a CSV.

Note: This method is suited for situations in which you already have a CSV file in the DBFS and you need to transfer it elsewhere (either your local machine or another location). The CLI requires Python.

You can also combine it with the previous option if you don't like to mess around with the URL in the previous step (can’t blame you). You'll first need to install the CLI with the following command.


pip install databricks-cli

After installing the CLI, you'll need to authenticate. To do so, grab your personal access token. Execute the command below once you've created one.


databricks configure --token

This will prompt you for your workspace URL, which has the format https://<instance-name>.cloud.databricks.com, as well as your personal access token.

You're now all set to export an existing CSV file from DBFS, which you can use in the following command.


databricks fs cp dbfs:/your_folder/your_file.csv destination/your_folder/your_file.csv

If you have a lot of CSV files placed in a folder and you'd prefer to export the entire folder, rather than individual files, you can do so with the -r flag and select your folder instead of an individual file.


databricks fs cp -r dbfs:/your_folder destination/your_folder

And there you go! You’ll now have at least one CSV file that you can copy to your local machine or move to another destination as needed.

Method #3 for exporting CSV files from Databricks: Dump Tables via JSpark

This method is similar to #2, so check it out if using the command line is your jam. Here, we’ll use JSpark through the command line, though it’s based on Java instead of Python. The drawback is that JSpark will only allow you to export the CSV file to your local machine. Luckily, the JSpark GitHub is well documented, so it's fairly easy to install - that is if you're the sort of person who is comfortable with installing via GitHub.

Once you've installed JSpark you can run a SQL query and save the result to a CSV file with one line of code.


java -Dconfig.file=mycluster.conf -jar jspark.jar -q "SELECT * FROM fl_insurance_sample" -f csv -o fl_insurance_sample.csv

Here’s a breakdown of what’s happened in the code above:

  • -Dconfig.file --> optional custom config file.
  • -f --> output format. Other options are json, xml, html and simple.
  • -o --> output string, either stdout or a filename.

As I said, this is really similar to the previous method, so you can take whatever route is easiest for you based on your day-to-day tooling.

Method #4 for exporting CSV files from Databricks: External client tools

The final method is to use an external client tool that supports either JDBC or ODBC. One convenient example of such a tool is Visual Studio Code, which has a Databricks extension. This extension comes with a DBFS browser, through which you can download your (CSV) files. If VS Code is not your IDE of choice you can also go with this standalone DBFS Explorer. Both of these options will require some setup, after which they will provide you with an easy-to-navigate file browser. Go with your heart here and choose the tool that you find easiest to use. ❤️

Final thoughts and tips for exporting your CSV files

I hope you found this article useful and you've successfully exported your data from Databricks to a CSV file. If, however, you ran into issues or found that you're just not that happy with the methods that I’ve suggested there’s a better way: Census.

Census lets you easily sync data from Databricks to just about any other destination you can imagine, all without having to fuss with the destination’s API (or continuously Google articles like this every time you need to get data out of Databricks).

If you find the above methods a little tedious (as I generally do), go ahead and give Census a spin and see if it’s more your style.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: