8 minutes
Reducing loading times by a tenfold with Postgres aggregations
Time to market is a very important part of the life of the developers. In the agile world, a team must make a balance between speed and code quality. This does not necessarily mean that the code will not be good nor tested, but sometimes it limits exploration. A limited exploration means that a developer will do things the way he/she is used to.
At least from my experience, SQL is taught in university/courses in an outdated way, where a lot of new features are never explored and people are not aware of this. In 2013, Postgres released version 9.3 which introduced json aggregations, however not a lot of people are aware of this functionality.
In this post, you will see how this can be used to fetch data in 1..N relationships with a single trip to the database.
Enter the model
Let’s imagine the following model:
Populated with the following data:
Table person
id | name |
---|---|
fbab22a3-f7bc-422c-942f-d184ad189e29 | Pablo |
Table animal
id | owner_id | type | name |
---|---|---|---|
5f2461a9-6c62-453e-9182-38cd0fa76e01 | fbab22a3-f7bc-422c-942f-d184ad189e29 | cat | Bills |
Table vaccine
id | name |
---|---|
0ee28253-5b94-4096-926c-d33ceb22d24d | Feline herpesvirus |
6cf5487a-ba06-42e7-9e80-226590752a77 | Rabies |
Table animal_vaccine
vaccine_id | animal_id | applied_on |
---|---|---|
0ee28253-5b94-4096-926c-d33ceb22d24d | 5f2461a9-6c62-453e-9182-38cd0fa76e01 | 2021-02-10 |
6cf5487a-ba06-42e7-9e80-226590752a77 | 5f2461a9-6c62-453e-9182-38cd0fa76e01 | 2022-02-10 |
Here you can find the sql
script for creating the database and tables with the data listed above.
Now, imagine that another service needs to obtain all the people named Pablo
with all their pets and the vaccines applied to them. Below you can find different approaches to fetch the required data for creating the required response.
Approach 1: Many database trips
- A query will be executed to fetch all the vaccines from such table.
- A query will be executed to fetch all the people named
Pablo
. - For each person filtered, a query will be executed to fetch all the animals that the person owns.
- For each animal filtered, a query will be executed to fetch all the vaccines applied on it.
This approach is pretty much the one you don’t want to have. In code, it is simple to follow, and with some connection pooling you don’t have the overhead of creating database connection for each query; however the database trip is still there, and network latency applies. This is also not scalable when you are fetching for example 15 people, as you can see below.
A query to fetch vaccines:
SELECT id, name FROM vaccine
A query to fetch people:
SELECT id, name FROM person WHERE name = 'Pablo'
15 queries to fetch all the animals a person owns:
SELECT id, type, name FROM animal WHERE owner_id = $1
30 trips to fetch vaccines considering that each person may have 2 animals:
SELECT vaccine_id, applied_on FROM animal_vaccine WHERE animal_id = $1
A total of 47 trips to the database fetching this information. We can do better than this.
Approach 1.5: Less database trips
A query will be executed to fetch all the vaccines from such table:
SELECT id, name FROM vaccine
A query will be executed to fetch all the people named Pablo
:
SELECT id, name FROM person WHERE name = 'Pablo'
A query will be executed to fetch all animals whose owner_id is contained on the previous returned list:
SELECT owner_id, id, type, name FROM animal WHERE owner_id IN ($1)
A query will be executed to fetch all the vaccines for all the animals returned:
SELECT animal_id, vaccine_id, applied_on FROM animal_vaccine WHERE animal_id IN ($1)
This approach is a middle ground, and the main benefit is that the quantity of queries are not related to the amount of data: returning 15 people, will still execute 4 queries.
On the other hand, the code complexity increases as you need a more complex mapping. Note that on the last two queries the owner_id
and animal_id
need to be retrieved, as the list of animals will be iterated to add each animal to the corresponding owner; and the same happens with the vaccines.
Approach 2: How ORM does it
This is a single query approach, but with one particularity, repeatable data:
SELECT
p.id,
p.name,
a.id AS animal_id,
a.type,
a.name,
v.id AS vaccine_id,
v.name,
av.applied_on
FROM person p
LEFT JOIN animal a ON p.id = a.owner_id
LEFT JOIN animal_vaccine av ON av.animal_id = a.id
LEFT JOIN vaccine v ON av.vaccine_id = v.id
WHERE p.name = 'Pablo'
Executing the query above will print the following result:
id | name | animal_id | type | name | vaccine_id | name | applied_on |
---|---|---|---|---|---|---|---|
fbab22a3-f7bc-422c-942f-d184ad189e29 | Pablo | 5f2461a9-6c62-453e-9182-38cd0fa76e01 | cat | Bills | 0ee28253-5b94-4096-926c-d33ceb22d24d | Feline herpesvirus | 2021-02-10 |
fbab22a3-f7bc-422c-942f-d184ad189e29 | Pablo | 5f2461a9-6c62-453e-9182-38cd0fa76e01 | cat | Bills | 6cf5487a-ba06-42e7-9e80-226590752a77 | Rabies | 2022-02-10 |
While generally ORM does the magic mapping behind the scenes and the developer just gets the data aggregated correctly; if a developer wants to map this data manually he/she will need to consider:
- A person can appear more than once.
- An animal can appear more than once.
Also, keep attention to the LEFT JOIN
here. A person may not have an animal, or an animal may not have any vaccine; enforcing INNER JOIN
will cause this data to be filtered out and not appear on the result set.
This approach shows 10 times improvement in comparison to the approach #1, but keep in mind that when returning a lot of columns you will get a lot of repeated data which translates into more network traffic.
Approach 3: Json aggregates
This approach takes the previous one (keeping the joins) but using json arrays for each of the data points that have a relationship 1 to N with person
in order to avoid repeatable rows.
- A json array for animals (can be more than one)
- A json array for animal vaccines (an animal can have one or more than one vaccine)
- A json array for vaccines (a person can have N animals and each animal can have N vaccines)
SELECT
p.id,
p.name,
json_agg(DISTINCT to_jsonb(a.*)) AS animals,
json_agg(DISTINCT to_jsonb(av.*)) AS animal_vaccine,
json_agg(DISTINCT to_jsonb(v.*)) AS vaccine
FROM person p
LEFT JOIN animal a ON p.id = a.owner_id
LEFT JOIN animal_vaccine av ON av.animal_id = a.id
LEFT JOIN vaccine v ON av.vaccine_id = v.id
WHERE p.name = 'Pablo'
GROUP BY p.id, p.name
Breaking it:
json_agg(DISTINCT to_jsonb(a.*)) AS animals
This will return a json array [...]
, on which each row will be returned as a json object (using the column name as json properties) and applying it a DISTINCT
to discard repeated results. As we saw on the approach #2, an animal will appear for each vaccine it has received. The distinct filters out that repeated data.
Also the GROUP BY p.id, p.name
is used to make sure the person is returned only once as well.
id | name | animals | animal_vaccine | vaccine |
---|---|---|---|---|
fbab22a3-f7bc-422c-942f-d184ad189e29 | Pablo | [{“id”: “5f2461a9-6c62-453e-9182-38cd0fa76e01”, “name”: “Bills”, “type”: “cat”, “owner_id”: “fbab22a3-f7bc-422c-942f-d184ad189e29”}] | [{“animal_id”: “5f2461a9-6c62-453e-9182-38cd0fa76e01”, “applied_on”: “2021-02-10T00:00:00”, “vaccine_id”: “0ee28253-5b94-4096-926c-d33ceb22d24d”}, {“animal_id”: “5f2461a9-6c62-453e-9182-38cd0fa76e01”, “applied_on”: “2022-02-10T00:00:00”, “vaccine_id”: “6cf5487a-ba06-42e7-9e80-226590752a77”}] | [{“id”: “0ee28253-5b94-4096-926c-d33ceb22d24d”, “name”: “Feline herpesvirus”}, {“id”: “6cf5487a-ba06-42e7-9e80-226590752a77”, “name”: “Rabies”}] |
This surely looks messy at first, but you can compare it with the approach #1.5. The mentioned one relied on receiving list of animals and vaccines, but it was a single list for all the people filtered out, in this case the list of animals and vaccines belong to the person on whose row is being iterated.
For each person:
- Deserialize animals.
- Deserialize vaccines.
- Deserialize animal vaccines.
In golang, these arrays can be represented as []byte
so each one could be used as input for deserializing into a structure with the corresponding fields.
Migrating from approach #1 to approach #3 showed improvements between 10x and 15x.
Bonus approach: Doing the mapping on Postgres - but are you sure you want to do this?
In the approaches above, different ways to fetch data were described, but the mapping ocurred somewhere in the code that executes these queries. Is it possible to do the required mapping using Postgres?
SELECT
jsonb_build_object(
'id', p.id,
'name', p.name,
'animals', (
SELECT json_agg(jsonb_build_object(
'id', a.id,
'type', a.type,
'name', a.name,
'vaccines', (
SELECT json_agg(jsonb_build_object(
'id', v.id,
'name', v.name,
'applied_on', av.applied_on
))
FROM animal_vaccine av
INNER JOIN vaccine v ON av.vaccine_id = v.id
WHERE av.animal_id = a.id
)
))
FROM animal a WHERE a.owner_id = p.id)
)
FROM person p
WHERE p.name = 'Pablo'
And the answer is yes. The query above, uses the function jsonb_build_object
for creating a json object with named properties (vs row_to_jsonb
which just took the *
operator); and the json_agg
is used for the relationships that are 1 to N.
For each nested object, a sub query is performed to fetch the relevant data, which gets rid of LEFT JOIN
and also GROUP BY
(that also gets rid of inherent sorting)
This returns:
{
"id": "fbab22a3-f7bc-422c-942f-d184ad189e29",
"name": "Pablo",
"animals": [
{
"id": "5f2461a9-6c62-453e-9182-38cd0fa76e01",
"name": "Bills",
"type": "cat",
"vaccines": [
{
"id": "0ee28253-5b94-4096-926c-d33ceb22d24d",
"name": "Feline herpesvirus",
"applied_on": "2021-02-10T00:00:00"
},
{
"id": "6cf5487a-ba06-42e7-9e80-226590752a77",
"name": "Rabies",
"applied_on": "2022-02-10T00:00:00"
}
]
}
]
}
Postgres, and some other databases, evolved a lot throughout time and are very capable of dealing with json types already. Under certain circumstances a query like the one above can be very useful.
However, keep in mind that doing this could cause the service to have a high coupling between the API model and the database queries; and a change of how you want the response to look like, may affect the way you execute a query.
Bottom line
- Postgres is very powerful, it is always useful to keep yourself up to date at least knowing what features are available, even when you don’t know the specifics.
- Keep in mind simplicity and try to improve as the system requires it.
- Do not favor premature optimization just for exploration reasons, but don’t be afraid to explore and think out of the box when needed.