SQL: Understand and INSERT statement’s performance consider

Abstract

INSERT statement is the one of most common statements used in SQL. But no more developers know how it works and its performance. Almost we use a framework instead of pure code and the framework does it. We just take care of the business logic of the project. You have a client-server project where clients send their data to save on the server. However, on a nice day, your project has a problem. Almost all requests become slow, sometimes down. You take a long time to figure out that your server is writing bulk data into a database. It's the main reason that the CPU's RDS becomes high, processing becomes slow, and your server can not receive response from RDS and down. Although, it worked perfectly before. So, what happened with your code? Yes, You have a problem in writing actions into the database. And now, this is time you review your knowledge of SQL’s INSERT statement. Let’s jump into it.

As we know, an INSERT statement will insert one or more rows of data into a table. We can specify which column and what data column will be inserted into your table. This statement is considered a SQL data manipulation command. 

In a basic INSERT statement, you must specify the name of the target table and the data values you want to insert into that table. When necessary, you should qualify that table name with the name of the server, database, or schema.

INSERT INTO Users  VALUES (1, 'ThanhLD', 'thanhld@vitalify.asia');


INSERT INTO Users  VALUES 

(1, 'ThanhLD', 'thanhld@vitalify.asia'),

(2, 'ThanhLD2', 'thanhld2@vitalify.asia'),

(3, 'ThanhLD3', 'thanhld3@vitalify.asia'),

(4, 'ThanhLD', 'thanhld4@vitalify.asia');

With support from a framework or DB management system, the developer just writes an INSERT statement and executes it. But one INSERT statement  will processed in 6 phases in the background:

1.Open connection(3)

2.Sending query to server(2)

3.Parsing query (2)

4.Inserting row(1 × size of row)

5.Inserting indexes(1 × number of indexes)

6.Close connection (1)

The number after phase is the time required for inserting a row is determined by the following factors. The time of (1),(2),(3),(6) alway be fixed. But the time of (4), (5) will be increased by the number of your records. So, when INSERT statement performance becomes slow, we need to focus on (4), (5). And I will explain (4) in Section 2 below.

2. Why’s your INSERT become slow?


The INSERT statement’s performance is dependent on many factors such as: which RDS(MySql, Postgres, MariaDB..), which database’s engine(MyISAM is faster than InnoDB), how many database relationships or indexes, and how many inserted records. If all the things above were fixed after the project was running except the number of records will increase in the future. And the day you want to insert 10,000 records and performance is so slow. Yes, this is the time you need to review your INSERT statement. 

Almost all developers work with some framework. So, we just use built-in functions by framework. We call functions and pass the parameters and the framework executes it. How many ways to insert data do you know? And how's the performance of these ways? In this Section, I will explain about it.

2.1 Single INSERT

This is a single INSERT statement:


INSERT INTO Users  VALUES (1, 'ThanhLD', 'thanhld@vitalify.asia');

If you have 10,000 records, you will have 10,000 statements like this. You need to execute 10,000 times to insert. This way will be in a loop of 6 phases as I mentioned above. And I guess you were imagining how performance works when you have bulk data. In this way, if you do not handle pool connections well, it will be a nightmare with your RDS.

2.2 Batch/Bulk INSERT

The Batch/Bulk INSERT is based on the single INSERT. But the different thing is we combine many records in one INSERT statement. It means we can reduce the time of (1),(2),(3),(6). The most consumer time is (4),(5). With one record, it’s nothing. With bulk data, this is an obvious difference. The INSERT statement will be like this:


INSERT INTO Users  VALUES 

(1, 'ThanhLD', 'thanhld@vitalify.asia'),

(2, 'ThanhLD2', 'thanhld2@vitalify.asia'),

(3, 'ThanhLD3', 'thanhld3@vitalify.asia'),

(4, 'ThanhLD', 'thanhld4@vitalify.asia');

If you are inserting many rows from the same client at the same time, use INSERT statements with multiple values lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate Single Insert statements. If you are adding data to a nonempty table, you can tune the bulk_insert_buffer_size variable to make data insertion even faster.

2.3 Load data

Load data in a file, I think no more normal developers worked with it. However, data engineers are very familiar. Load data in a file is data that was saved in a file extension in CSV, txt…We use the statement provided by SQL to fill the table with the best performance. The file’s format may match with table schema or not. We also can customize which data or column we want to insert.


LOAD DATA INFILE '/users.csv' INTO TABLE Users 

FIELDS TERMINATED BY ',' 

LINES TERMINATED BY '\r\n'

IGNORE 1 LINES;


When loading a table from a text file, use Load Data. This is usually 20 times faster than using INSERT statements. And of course, you need to create a data file before executing this statement. The implementation time is more but less than the execution time.

3. Measurement


To measure the INSERT statement’s performance, I will expose the time consumption in the chart below. You can see which is the best.

The chart above has obvious differences between Single Insert vs. Batch/Bulk Insert and Load Data. The number of records increased, and the time consuming was increased. The result is similar between Batch/Bulk Insert and Load Data. So, we jump to the next chart to see it.

In case 1, 1.000, 10.000, 100.000, we also see the difference. The time it takes is the same. But, the difference becomes obvious when the number of records is 1.000.000. The Load Data is sped up to 2 times to the Batch/Bulk Insert.

4. Conclusions


We can see that, in case 1 record, no more difference in performance. But when the number of records increases, the power of Load Data Insert is the best way. If you have only 1 record to insert, the Single Insert is the best choice by easing. When the number of records is so big, the Batch/Bulk Insert or Load Data is the way you need to consider although the implementation may be more complex.

If you want to run by yourself. This GitHub repository may be useful(the repository is also available in Section 5)