Michel Zurkirchen is an Amsterdam-based digital analyst and a regular contributor to the Census blog. After starting his career in digital marketing, Michel started learning Python and has been an enthusiastic learner and practitioner of data analytics ever since.
What you'll learn in this article: How to export a CSV from Databricks using the four following methods:
- Databricks Notebook
- Databricks command-line interface
- Dump Tables via JSpark
- External client tools
If you’re reading this article (hi 👋), chances are you have some reason to want to export CSV files for data analysis. You might just be here because someone on your team needs a quick snapshot of how a recent ads campaign went for data analytics, or because you want to aggregate some event data about customer behavior.
As you already know, Databricks offers a lot of awesome functionality that makes it a powerful tool for most teams, including the ability to export data to a CSV file. Unfortunately, this feature is kinda difficult to get to. Rather than continue to try and puzzle it out, make your life easier and check out these four ways of getting the job done.
Method #1 for exporting CSV files from Databricks: Databricks Notebook
Databricks Notebook is Databricks's version of an IPython Notebook and comes with the same functionalities, such as manipulating and exporting data. Once you're done manipulating your data and want to download it, you can go about it in two different ways:
- Downloading your full dataset (if you have less than 1 million rows) using two lines of Python.
- Exporting your dataset to DBFS (if you have more than 1 million rows of data) and then downloading it using two lines of Python and a non-intuitive approach (AKA an admittedly wonky URL). I’ll explain more in a sec.
The first (and easier) method goes like this.
# Loading a table called fl_insurance_sample into the variable dfdf = spark.table('fl_insurance_sample') # Displaying a preview of the data display(df.select("*"))
Underneath the preview, you'll see a download button, with an arrow to the right. If you click the arrow, you'll get a small dropdown menu with the option to download the full dataset.
If you have less than 1 million rows, you’re all set! 🎉 Pat yourself on the back and go get another coffee.
If you have more than 1 million rows, you’re going to need the code below instead.
# Loading a table called fl_insurance_sample into the variable dfdf = spark.table('fl_insurance_sample') # Storing the data in one CSV on the DBFSdf.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/fl_insurance_sample.csv")
If your dataset is large enough, Databricks will want to split it across multiple files. Using
.coalesce(1) forces Databricks to write all your data into one file (Note: This is completely optional).
.coalesce(1) will save you the hassle of combining your data later, though it can potentially lead to unwieldy file size.
.option("header", "true") is also optional, but can save you some headache. It tells Databricks to include the headers of your table in the CSV export, without which, who knows what’s what?
Other options that you may find useful are:
.option("delimiter", "your_delimiter"): Define a custom delimiter if, for example, existing commas in your dataset are causing problems. If this is the case, try a different character, such as "|".
.option("compression", "gzip"): Compress files using GZIP to reduce the file size.
.option("nullValue", "replacement_value"): Replace null values with the value you define.
.option("escape", "escape_char"): Escape specific characters.
.option("encoding", "utf-8"): By default set to utf-8.
You can chain these into a single line as such.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").option("compression", "gzip").save("dbfs:/FileStore/df/fl_insurance_sample.csv")
Once you've exported the CSV file to the DBFS, you can navigate to it by altering the following URL: https://<instance-name>/files/df/fl_insurance_sample.csv.
If your URL contains ?o= followed by a set of numbers, use https://<instance-name>/files/df/fl_insurance_sample.csv?o=###### instead, replacing # with the numbers in your URL.
Yes, that's the official way. Yes, it's weird that this is the default way to do it. No, it's not intuitive.
Regardless, it works. But in case you don’t want to deal with it, here’s another method.
Method #2 for exporting CSV files from Databricks: Databricks command-line interface
Databricks lets you do a great number of things through the command-line interface (CLI), including exporting a CSV.
Note: This method is suited for situations in which you already have a CSV file in the DBFS and you need to transfer it elsewhere (either your local machine or another location). The CLI requires Python.
You can also combine it with the previous option if you don't like to mess around with the URL in the previous step (can’t blame you). You'll first need to install the CLI with the following command.
pip install databricks-cli
After installing the CLI, you'll need to authenticate. To do so, grab your personal access token. Execute the command below once you've created one.
databricks configure --token
This will prompt you for your workspace URL, which has the format https://<instance-name>.cloud.databricks.com, as well as your personal access token.
You're now all set to export an existing CSV file from DBFS, which you can use in the following command.
databricks fs cp dbfs:/your_folder/your_file.csv destination/your_folder/your_file.csv
If you have a lot of CSV files placed in a folder and you'd prefer to export the entire folder, rather than individual files, you can do so with the
-r flag and select your folder instead of an individual file.
databricks fs cp -r dbfs:/your_folder destination/your_folder
And there you go! You’ll now have at least one CSV file that you can copy to your local machine or move to another destination as needed.
Method #3 for exporting CSV files from Databricks: Dump Tables via JSpark
This method is similar to #2, so check it out if using the command line is your jam. Here, we’ll use JSpark through the command line, though it’s based on Java instead of Python. The drawback is that JSpark will only allow you to export the CSV file to your local machine. Luckily, the JSpark GitHub is well documented, so it's fairly easy to install - that is if you're the sort of person who is comfortable with installing via GitHub.
Once you've installed JSpark you can run a SQL query and save the result to a CSV file with one line of code.
java -Dconfig.file=mycluster.conf -jar jspark.jar -q "SELECT * FROM fl_insurance_sample" -f csv -o fl_insurance_sample.csv
Here’s a breakdown of what’s happened in the code above:
-Dconfig.file--> optional custom config file.
-f--> output format. Other options are json, xml, html and simple.
-o--> output string, either stdout or a filename.
As I said, this is really similar to the previous method, so you can take whatever route is easiest for you based on your day-to-day tooling.
Method #4 for exporting CSV files from Databricks: External client tools
The final method is to use an external client tool that supports either JDBC or ODBC. One convenient example of such a tool is Visual Studio Code, which has a Databricks extension. This extension comes with a DBFS browser, through which you can download your (CSV) files. If VS Code is not your IDE of choice you can also go with this standalone DBFS Explorer. Both of these options will require some setup, after which they will provide you with an easy-to-navigate file browser. Go with your heart here and choose the tool that you find easiest to use. ❤️
Final thoughts and tips for exporting your CSV files
I hope you found this article useful and you've successfully exported your data from Databricks to a CSV file. If, however, you ran into issues or found that you're just not that happy with the methods that I’ve suggested there’s a better way: Census.
Census lets you easily sync data from Databricks to just about any other destination you can imagine, all without having to fuss with the destination’s API (or continuously Google articles like this every time you need to get data out of Databricks).
If you find the above methods a little tedious (as I generally do), go ahead and give Census a spin and see if it’s more your style.