Written by Ramón Saquete
Table of contents
I had already spoken previously about images and WPO, and that one of the main factors affecting global download time are front-end images and back-end data base access. Let’s now talk about database optimisation.
Database access optimisation is vital for page download time, because it’s usually the primary cause affecting the amount of time an Internet browser has to wait to receive the HTML. This waiting time is very important, because the remaining page resources (images, scripts and style sheets) don’t start downloading until the browser reads the HTML from which those resources are being referred to.
This optimisation type is probably the most complex one of all. First, because it depends on two factors, variable in time: on the one hand, what kind of queries are going to be made and how, and on the other, the workload the server(s) are going to have to manage. Second, because of the sheer amount of expertise and knowledge you need to have to be able to rewrite queries, to rewrite code executing these queries, to create indices, materialised views, horizontal and vertical partitions, replicas, support tables, choosing the adequate data types, knowing how to optimise the the schema without disregarding the logic of the business model, knowing how to adjust DBMS configuration parameters, being familiar with and knowing how to use external cache systems. Moreover, each of these points –on which many books have been written– has to be tackled in a different way, depending on what DBMS we’re dealing with (MySQL, SQL Servier, Oracle…) and even on their singular version.
If we talk about non-relational DBMS or NoSQL, we have a great variety of data model types: key/value models, graph-oriented, column-oriented, multidimensional, object-oriented… However, the data model that popularised the NoSQL term the most are document-oriented databases, with key/value pairings, where the value is usually a hierarchic structure. Many developers dive into the unknown by using these new DBMS (MongoDB, Google BigTable, Apache Cassandra…) thinking about the speed they are going to gain, but never stopping to consider that the speed achieved by these DBMS is not implementing ACID properties or JOIN operations. ACID allows us to avoid ending up with corrupt and useless data, and if we choose a NoSQL, JOIN operations won’t ever be needed.
What happens in these cases, is that the client’s requirements change, whether during development or maintenance, and so the database schema has to be modified. In this scenario, the NoSQL DBMS stops being an alternative that fits in with the business model, and so the developer is the person who, ultimately, ends up implementing what the relational DBMS does through code. I’m not by any means saying that with relational DBMS you won’t need to implement code to adjust the database logical model to the conceptual business model, what I’m saying is that it will require less effort than with a NoSQL DBMS.
Where NoSQL really stand out (especially the key/value ones) is for cache system purposes, whether it’s memcached (memory only) or a hybrid like Redis, which is a nice complement to any relational DBMS which has to do many readings.
Now that I’ve given you a small sketch of how the whole picture looks, let’s briefly see a few interesting tricks for dealing with common situations we’re going to encounter in web applications.
I am going to use the standard nomenclature in my explanations. I’m saying this so you can keep in mind that when I’m saying “relation”, I’m referring to what many incorrectly call “table” in relational DBMS.
Tree-structure quick queries
Tree data structures are very common in web environments. We can see them, for example, in breadcrumbs, navigation menus with nested submenus, and comment threads. Usually, whenever we want to implement a relational database, we create a foreign reflexive key which goes from a parent column to the primary key in the relationship. Example:
The problem here is, in order to know what are all the other previous nodes or all its descendants, we have to make several queries or various JOIN operations with the table itself, one for each level of the tree, and that is highly inefficient.
There are many possible solutions to this problem, but the one I like the most is to store all the nodes pending from another one in a new relationship, which is also a reflexive one, in which we’ll have as the primary composite key, the ancestor node, and the descendant node. These, in turn, will be foreign keys to the original relationship’s primary key. Additionally, there will be another column to see the distance between nodes. Here’s an example showing the same information as in the previous table:
This way, using just one JOIN with the original relationship we can obtain all ancestors or descendants of a node and show them on a website.
Pagination can often be an arduous task when we have to display several page numbers, because for this happen we need to calculate the number of rows in the relationship, which may require a full reading, depending on the DBMS. Displaying links for next and previous pages will always be more efficient, recovering all rows and an additional one. so that the next page link is displayed if said additional row comes from the DB. Alternatively, we can maintain an independent relationship for the number of rows.
Another thing which can affect pagination is that when we access an internal page, a certain number of rows belonging to previous pages must be discarded, which are read from the disk with all the columns appearing in the query. A good way to avoid this would be to carry out a JOIN with a subquery recovering identifiers we need before reading all the information. Example:
SELECT item1, item2, item3, FROM Foo INNER JOIN(SELECT id FROM Foo ORDER BY item1 LIMIT 50,10) AS Foo Paginado USING(id);
Displaying random rows
It is a pretty common occurrence to display random results from a relationship to give the content a fresh feeling. The problem with this practice is that it’s usually implemented in rather inefficient ways, namely by ordering by random values,, which can cause all table rows to be read. In MySQL, if we have autonumeric identifiers, the best thing to do is to cross them with random identifiers. In the following example you can see how it can be implemented, so that a random row from a Foo relationship can be recovered:
MySQL: SELECT name FROM Foo AS f1 JOIN (SELECT (RAND() * (SELECT MAX(id) FROM Foo)) AS id) AS f2 WHERE f1.id >= f2.id ORDER BY f1.id ASC LIMIT 1
En otros SGBD tenemos soluciones propietarias:
Sql Server: SELECT * FROM Foo TABLESAMPLE (1 ROWS);
Oracle: SELECT * FROM (SELECT * FROM Bugs SAMPLE (1) ORDER BY dbms_random.value) WHERE ROWNUM=1
Caching frequent queries
Activating DBMS cache can sometimes worsen performance. This depends on the volume of data that is usually accessed. If the volume of data is large and it’s frequently updated, we will be always writing to the cache instead of reading from it. In this scenario it is always preferable to use an external cache system, in which we can control that the newest and most frequent data is really going to be cached. If the cache system doesn’t provide this funcionality, we can implement it ourselves. A good way to do this avoiding the extra cost of storing the number of times a query is accessed, is by generating a random number between 1 and 100, with the query only being cached if the number is lower than 10. This way, the query will be stored in cache with a 10% probability in each request, meaning that if the query has more requests, it will be more likely to get stored in cache.
Here’s a few small recommendations to finish this post:
- Change the OR by IN, when you want to compare more than one value.
- Minimise the cost of the JOIN: Natural concatenation or JOIN is the costliest operation for relational databases, because it requires making a cartesian multiplication and a selection of values. Some techniques we can use to minimise the effect:
- Reordering them to concatenate first the relationships with lower amount of rows to reduce the number of intersections.
- Creating subqueries where we will filter or limit the number of rows in large relationships before carrying out later JOINs.
- Sometimes, it’s best to divide a query in several different ones, instead of doing everything with just one. This way we can get a few identifiers with the first query, which we can pass with an IN to the next query, instead of carrying out a JOIN.
- Switching the JOIN to EXISTS if no intersection data from the relationship is going to be displayed.
- Keeping the N + 1 problem in consideration: n+1 usually happens when we have a list where, to display it the way we want, we need to carry out an additional query for each item (the +1 from n+1). In this case, it is usually best to carry out one or two additional JOINs, in the query that recovers the list of items. This way, we obtain the list just as we need it, and no additional queries for each item are necessary.
- Always specifying the names of the columns in the SELECT, otherwise the DBMS will read all rows of a disk. The asterisk should be used if (only IF) COUNT is also used, in which case the DBMS will know it doesn’t have to read all columns.
- Creating indices: indices allow a much faster non-sequential access to data, but they are costly to create. Their use is not advised unless there are more readings than writings. We must analyse the query execution plan (each DBMS has its own way of seeing it) to know where we need to create indices. Usually, we create indices in foreign keys, and in columns that are going to be used with ORDER BY or WHERE. If we create composite indices, columns must be placed in the same order in which they are going to be used in the queries.
That is all, I hope you enjoyed this post.