More and more companies are using PostgreSQL, one of the world's most popular relational database management systems (RDBMs). Personally, I am a big fan for two reasons. PostgreSQL allows you to easily distribute queries across multiple nodes running in parallel and to create custom procedures, and it also supports advanced SQL and numerous window functions, aggregates, table samplings, and other cost-based query optimizations.
This post will focus on PostgreSQL joins and show you a way to connect two tables that are not related by a foreign key relationship.
As you may know, joins are used to combine data from two different tables. The way you combine them depends on the type of join you use. In order to join two or more tables, they have to be related by a key; if not, it’s not easy to join them. Theoretically, you wouldn’t run into this case, because the whole point of doing a join is to extract data that has a specific relationship between tables. If there is no relationship, why would you need to join?
Here are two of the most common cases that require joining unrelated tables:
- You might need to create your own connection between two tables that store related data but are missing the foreign key. This is a fairly rare scenario that usually results from poor database management; for the data analyst, it’s always useful to have some joining tricks in your toolbox in case it ever comes up.
- The data is going from your database to any number of visualization tools through different channels (data sources). These data channels often have similar or related data, but since they have different sources (e.g., Salesforce, Google Analytics, etc.), you might need to create a relationship between them. In my experience, this is the more common of the two scenarios.
As a data analyst, I often work with different visualization tools that are connected to a database (RedShift in this case) through data stores and sources that pipe data from the database to different dashboards (such as Tableau, Chart.io, Looker, and other applications). We often have to customize a chart to represent data from unrelated tables or even from multiple sources. Once I ran into a case where I had to join Table A and Table B that were not related; that is, they did not have related columns. Below I’ll describe my method of joining these tables by creating an ID column for each table, then doing a classic join on that ID.
My goal in joining these two unrelated tables was to get data in a very specific format for a chart within a dashboard that was connected to a data source. In this case, I couldn’t apply UNION or UNION ALL (which would be my first solution), because each of my SELECT statements had a different number of columns with different data types.
Therefore, my solution was to create a unique ID for each table inside of a subquery, and then do a self-join on that ID, as depicted in the following steps.
1. In your first subquery, create an ID column based on the available data:
WITH actions AS ( SELECT ca.created_at::date as actdate, COUNT(*) AS act, CASE when ca.state IN ('CA', 'WA', 'DC', 'NY', 'IL') THEN ca.state ELSE 'other' END AS state, CASE WHEN ca.current_medium IN ('source1', 'source2', 'source3') THEN ca.current_medium ELSE 'other' END AS source, lower('id-' || ca.created_at::date || '-' || CASE when ca.state IN ('CA', 'WA', 'DC', 'NY', 'IL') THEN ca.state ELSE 'other' END || '-' || CASE WHEN ca.current_medium IN ('source1', 'source2', 'source3') THEN ca.current_medium ELSE 'other' END ) AS id FROM events e JOIN create_action ca ON ca.action_id = e.id WHERE e.type > 0 AND ca.created_at::date > getdate()::date - 60 AND ca.created_at::date <= getdate()::date - 1 GROUP BY actdate, state, source ORDER BY actdate, state, source ),
2. Then create an ID column for the second subquery:
action_view as ( SELECT created_at::date AS viewdate, COUNT(av.userid) AS act_v, CASE when av.state IN ('CA', 'WA', 'DC', 'NY', 'IL') THEN av.state ELSE 'other' END AS state, CASE WHEN av.current_medium IN ('source1', 'source2', 'source3') THEN av.current_medium ELSE 'other' END AS source, lower('id-' || av.created_at::date || '-' || CASE when av.state IN ('CA', 'WA', 'DC', 'NY', 'IL') THEN av.state ELSE 'other' END || '-' || CASE WHEN av.current_medium IN ('source1', 'source2', 'source3') THEN av.current_medium ELSE 'other' END ) AS idp FROM act_view av WHERE created_at::date > getdate()::date - 60 AND created_at::date <= getdate()::date - 1 GROUP BY viewdate, state, source ORDER BY viewdate, state, source )
3. Finally, execute a join to connect two subqueries on that newly created ID:
SELECT av.sapdate AS date, lower(av.state) AS state, lower(av.source) AS source, av.act_v AS act_view, a.act::int AS act, FROM action_view av LEFT OUTER JOIN actions a ON av.idp = a.id ORDER BY date;
This example is just one of the workaround solutions for joining tables that are not related by a foreign key relationship, without applying UNION or CROSS joins. The described approach can help you connect as many tables as you need to get the right data. Additionally, there might be a scenario in which you create a VIEW using this method and then just query it wherever you need to pull the data.
Olga Berezovsky has been a Senior Data Analyst at Primitive Logic since 2016. She has extensive experience in the big data industry — specifically in data acquisition, transformation, and analysis — and deep expertise in building quantitative and qualitative user profile analysis that reveals user insights and behavior.