Tutorials

How to use SQL for geographic place recommendations | Census

Parker Rogers
Parker Rogers October 08, 2021

Parker is a data community advocate at Census with a background in data analytics. He's interested in finding the best and most efficient ways to make use of data, and help other data folks in the community grow their careers. Salt Lake City, Utah, United States

In this article, you'll learn how to use SQL to send highly personalized email drip campaigns using geographic place recommendations and the haversine function. Specifically, we'll break down:

  • What is the haversine formula?
  • How to find users who Yelp within five miles of a business
  • How to find users who have Yelped more than five times about businesses like our example
  • How to find users who leave at least a four-star average
  • How to operationalize our resulting data at scale

Let’s say you have some lead data containing geographic coordinates. What can you do with it? How can it help your business? With a little SQL skill and the implementation of the haversine formula, you can query through your data for the following digital marketing use cases:

  • Send highly personalized email drip campaigns based on user location
  • Improve product analytics tools by adding custom location data to each user
  • Create geo-targeted digital ads that are more likely to convert

We'll write a query to help us with the first geotargeting use case: Send highly personalized email drip campaigns based on user location.

This cheat sheet will help you understand how to use SQL and the haversine formula for geographic place recommendations (AKA geotargeting). At the end, we'll also help you operationalize this data by syncing it to frontline tools like your CRM, product analytics platform, and more.

Short on time (or just prefer to learn via video vs write-ups)? Check out the TL;DR episode for this article.

Now let’s get started!

What is the haversine formula?

The haversine formula calculates the kilometric distance between two locations using latitudinal and longitudinal coordinates. For example, it can help us calculate the distance between two major landmarks (like the Eiffel Tower and the Statue of Liberty, as seen in the example below).

According to the haversine formula, the two landmarks are 5,837 KM apart. Simple enough, right?

Now let’s use our example Yelp datasets and the Haversine Formula to upgrade our outreach campaigns and send granular, highly personalized email drip campaigns based on user location/IP address to improve conversions. Let's take a look at how geotargeting works with SQL.

How to build geographic place recommendations with SQL

If you're a SQL expert, or if you simply just want to see how the haversine formula works with our example Yelp dataset, feel free to skip ahead to the haversine formula query (no judgment).

If you're in it for the full SQL learning experience, let's dive in.

Understand the database table’s meanings & relationships

Before we query anything, we need to understand our data. Here’s a summary of 3 of the Yelp tables:

All three tables have unique identifiers which allow us to query between them as needed. Also, the business table has geographic fields (latitude and longitude) which we will use in the haversine formula.

Conceptualize the data you’ll need

We want to write a query that will allow us to send highly personalized email drip campaigns with location targeting. First, we'll need to identify the information we'll need about our target audience. We know that we want reviews from folks who live close to our business (not just tourists), actively use Yelp, and generally like restaurants like our own. So, we can break down the characteristics of these users as follows:

  1. Find users who Yelp within five miles of our business using the haversine formula
  2. Find users who have Yelped more than five times about businesses like ours
  3. Find users who give at least four-star reviews on average

Now it's time to get our hands dirty with some SQL.

Step #1: Find users who Yelp within five miles of our business

First, we need to create a subquery for our business, which we'll refer to as my_bus to reference it when we use the haversine formula. To do this, we'll use the following code:


WITH my_bus as(   
    SELECT      
        Business_id,
        name,      
        state,      
        city,      
        latitude,      
        longitude   
    FROM      
    	content.yelp.businesses   
    WHERE      
    	business_id = 'YJ8ljUhLsz6CtT_2ORNFmg')

Note: If you've been to Las Vegas, this information probably looks pretty familiar. :wink: We're using the famous Gordan Ramsay's Burger in Las Vegas, Nevada.

The SQL above generates the following table, including the specific location of our business.

BUSINESS_ID

NAME

STATE

CITY

LONGITUDE

LATITUDE

YJ8ljUhLsz6CtT_2ORNFmg

Gordon Ramsay Pub & Grill

NV

Las Vegas

-115.175757

36.117417

Next, we will use the JOIN clause to connect the three Yelp tables using the unique identifiers of each table.


SELECT   
    *
    FROM   
        content.yelp.users AS user1   
        JOIN      
            content.yelp.reviews AS rev1      
            ON user1.user_id = rev1.user_id   
        JOIN      
            content.yelp.businesses AS bus1      
            ON rev1.business_id = bus1.business_id
ORDER BY   
	user1.user_id DESC LIMIT 1000

The output of this query shows unique reviews (rev1) as well as the corresponding user (user1) and the corresponding business (bus1).

We'll then add these ID fields to the SELECT statement:


SELECT   
    rev1.review_id,   
    rev1.user_id,   
    user1.user_id,   
    rev1.business_id,   
    bus1.business_id

Here’s what the query output looks like with the ID fields in the SELECT statement. Notice that the respective user and business IDs match for each row.

Next, we join our temporary table (my_bus) using the haversine formula. This can look tricky at first glance, so let’s break it up into two parts:

Part one: Create JOIN clause containing the haversine formula:

  • Pass the latitudinal and longitudinal coordinates of the business (bus1)
  • Pass the latitudinal and longitudinal coordinates of our business (my_bus)
  • Convert the haversine formula calculation from kilometers to miles.

Here's the code for this part.


JOIN   
    my_bus   
    ON (haversine(bus1.latitude, bus1.longitude, my_bus.latitude, my_bus.longitude)*0.621371)

This join clause goes through each individual row of data and calculates the geographic distance between the business (bus1) and our business (my_bus). The multiplication at the end of the JOIN converts the haversine value from kilometers to miles.

You can now view the results of the haversine formula by adding it as a field in the SELECT statement.


SELECT   
	(haversine(bus1.latitude, bus1.longitude, my_bus.latitude, my_bus.longitude)*.621371),   *

Part two: Include operators and conditions to help us find users who Yelp within 5 miles of our business.

  • Use the “<” operator to only join businesses that are within 5 miles of our business
  • Use the AND condition to remove our business (my_bus)  from the query. We are looking for users who haven’t yelped about our business.

To do this, we'll use the following code:


JOIN   
    my_bus   
    ON (haversine(bus1.latitude, bus1.longitude, my_bus.latitude, my_bus.longitude)*0.621371) < 5   
    AND bus1.business_id = my_bus.business_id

Now we can put it all together. So far, our full SQL query looks like this:


WITH my_bus as(   
    SELECT      
        Business_id,
        name,      
        state,      
        city,      
        latitude,      
        longitude   
    FROM      
    	content.yelp.businesses   
    WHERE      
    	business_id = 'YJ8ljUhLsz6CtT_2ORNFmg')
SELECT(haversine(bus1.latitude, bus1.longitude, my_bus.latitude, my_bus.longitude)*0.621371),   
	*
FROM   
    content.yelp.users AS user1   
    JOIN      
        content.yelp.reviews AS rev1      
        ON user1.user_id = rev1.user_id   
    JOIN      
        content.yelp.businesses AS bus1      
        ON rev1.business_id = bus1.business_id   
    JOIN      
        my_bus      
        ON (haversine(bus1.latitude, bus1.longitude, my_bus.latitude, my_bus.longitude)*0.621371) < 5      
        AND bus1.business_id <> my_bus.business_id
ORDER BY   user1.user_id DESC LIMIT 1000

Congrats! You know how to use the haversine formula to use geotargeting! :tada:. We've officially used it to complete part one of our use case to send highly personalized drip campaigns to our users based on their foot traffic (location).

✅  Find users who Yelp within five miles of our business

✅  Find users who have Yelped more than five times about businesses like ours

✅  Find users who give at least four-star reviews on average

Next, we'll blitz through steps two and three of our use.

Step #2 & #3: Find users who have Yelped more than five times about businesses like ours w/ at least a four-star average

Next, we'll find users who have a history of visiting local businesses like ours to further tailor our marketing efforts and marketing campaigns. For the sake of code efficiency, we'll also find folks who leave an average of four-star reviews in the same sweep.

To start, we'll use a WHERE clause so our query only includes businesses in the same category as ours: sitdown restaurants.

Here's the code:


WHERE   
    lower(bus1.categories) LIKE ANY ('%restaurant%', '%food%')   
    AND NOT lower(bus1.categories) LIKE ANY( '%convenience%', '%shopping%', '%grocery%', '%fast%' )

We'll then use a GROUP BY statement so our query shows unique users rather than unique reviews.


GROUP BY   
	user1.user_id

Note: Don’t run your SQL code yet. GROUP BY statements tend to break queries. First, change the fields in your SELECT statement to the following:


SELECT   
    user1.user_id AS user_id,   
    AVG(haversine(bus1.latitude, bus1.longitude, my_bus.latitude,  my_bus.longitude)*.621371) AS avg_distance_from_my_bus,   
    COUNT(rev1.review_id) AS total_reviews_near_my_restaurant,   
    AVG(rev1.stars) AS avg_stars_in_restaurant_category

These new fields will provide the following information:

  • user_id: the unique id of a user
  • avg_distance_from_my_bus: The average distances between my business and the Yelp businesses that the user reviewed
  • total_reviews_near_my_restaurant: The total number of Yelp reviews a user has made about businesses similar to ours.
  • avg_stars_in_restaurant_category: The average number of stars of a user’s reviews.

Finally, we'll run a HAVING clause. This clause will check the box on both steps #2 and step #3 to narrow down folks who:

  1. Have reviewed more than five businesses near our business
  2. Have an average star rating of at least four

Here's the code:


HAVING   
    total_reviews_near_my_restaurant > 5   
    AND avg_stars_in_restaurant_category >= 4

Phew! That was a lot of SQL. You've completed the three tasks necessary to send highly personalized email drip campaigns based on user location:

✅  Find users who yelp within 5 miles of our business

✅  Find users who have yelped 5+ times about businesses like ours

✅  Find users who give at least four-star reviews on average

Here’s what our full query looks like:


WITH my_bus as (   
    SELECT      
        business_id,
        name,      
        state,      
        city,      
        latitude,      
        longitude   
    FROM      
    	content.yelp.businesses   
    WHERE      
    	business_id = 'YJ8ljUhLsz6CtT_2ORNFmg')

SELECT   
    user1.user_id AS user_id,   
    AVG( haversine( bus1.latitude, bus1.longitude, my_bus.latitude,                                                 my_bus.longitude )*.621371) AS avg_distance_from_my_restaurant,   
    COUNT(rev1.review_id) AS total_reviews_near_my_restaurant,   
    AVG(rev1.stars) AS avg_stars_in_restaurant_category
FROM   
    content.yelp.users AS user1   
    JOIN      
        content.yelp.reviews AS rev1      
        ON user1.user_id = rev1.user_id   
    JOIN      
        content.yelp.businesses AS bus1      
        ON rev1.business_id = bus1.business_id   
    JOIN      
        my_bus      
        ON ( haversine( bus1.latitude, bus1.longitude, my_bus.latitude, my_bus.longitude )* 0.621371 ) < 5      
        AND bus1.business_id <> my_bus.business_id
WHERE   
    lower(bus1.categories) LIKE ANY ('%restaurant%', '%food%')   
    AND NOT lower(bus1.categories) LIKE ANY( '%convenience%', '%shopping%', '%grocery%', '%fast%' )
GROUP BY   
	user1.user_id
HAVING   
	total_reviews_near_my_restaurant > 5   
	AND avg_stars_in_restaurant_category >= 4
ORDER BY   
	user1.user_id DESC LIMIT 1000

So you’ve successfully completed the query. Now what? Well, you could export the data as a CSV and then import it into your CRM. Or use your command line. (If your eyes are glazing over, don't worry. There's a better way).

The better way to operationalize your user data: Reverse ETL

You just did a lot of SQL work to get a usable, granular set of customer data, don't waste all that work now. Instead, use reverse ETL to easily move your data from your warehouse out to your frontline tools to quickly, reliably fuel your email campaigns (and every other outreach effort you have on your plate).

At Census, we’ve created a tool that removes the headaches of syncing your data warehouse to your CRM (and beyond). And the best part is you can try it for free.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: