# Optimizing the code using the sort function as an example

A number of algorithms can be distinguished that are basic and underlie almost every line of programs written in high-level languages. It is good to have at hand the classic multivolume work of Donald Knuth “The Art of Computer Programming”, where many basic algorithms are analyzed in detail. But reading and assimilating everything is a task that requires a lot of effort and time, which must be somehow motivated.

Many may assume that it was necessary to know the nuances 50 years ago, but now you can use ready-made packages and functions and not dive into details. However, this is not the case. Likewise, no one has canceled the importance of understanding the representation of methods for storing data in memory and processing it in the processor.

Next, let’s look at the nuances using the example of sorting functions. Sorting and searching are used most often in all data manipulation. Saving a few milliseconds on operations can lead to an overall reduction in computing hours on significant data.

## Introduction

Let us formulate a simplified formulation of the problem, based on practical problems for working with time series. A set of data is examined containing a number of use cases, within which certain sequences of events occur.

Suppose that for algorithmic processing it is necessary to prepare a test dataset containing three columns:

`case_id`

– unique case / use case identifier;`record`

– a log record of an event in a case;`start`

– registration time.

Libraries used

1 2 3 |
library(tidyverse) library(data.table) library(rTRNG) |

The most interesting task is the generation of timestamps. Events should go sequentially in time for each case separately. First, let’s prepare a simple template. In a particular case, we will take a small number of cases for demonstration. There can be `10^5-10^n`

in production, which is determined by the tasks.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# A tibble: 10 x 2 case_id recs <int> <chr> 1 1 first 2 1 one and a half 3 1 second 4 1 third 5 1 fourth 6 1 fifth 7 1 sixth 8 2 first 9 2 one and a half 10 2 second |

Now let’s get down to an interesting block – generating timestamps. For simplicity, the problem will be reduced to the distribution of shares in the interval `[0; 1]`

within each case. Let’s leave the translation to the real unixtimestamp outside, it’s not interesting. Explicit looping options are also out of bounds. The execution times are given on a conditional computer, the main thing is that everything is performed on one.

## Create one timestamp

### Option 1. Straight

This option is offered in most cases. And that, everything is simple and clear.

We get such conditional indicators. Probably not bad. But do not forget that there are only 100 cases.

1 2 |
median `itr/sec` mem_alloc 15.38ms 63.2 284.9KB |

Let’s think about what can be improved?

### Option 1 + 1/2. Straight line + fast number generator

There is a good `rTRNG`

library. At large volumes, it provides a significant acceleration, including due to parallel generation. Let’s just check:

1 2 |
median `itr/sec` mem_alloc 29.34ms 29.5 284.9KB |

We didn’t get any gain on small volumes. It’s all? Of course not. We know `tidyverse`

is slower than `data.table`

, so let’s try it.

### Option 2. One-pass, through data.table indexes

Here we will try to apply the first trick – sort the vector of times by indices, and then reassign it.

It turns out quite well, the acceleration is 15-20 times and the memory is required almost three times less.

1 2 |
median `itr/sec` mem_alloc 1.69ms 554. 109KB |

Are we stopping? Why yes?

### Option 3. Single-pass through a composite index

In fact, as soon as we fall into a loop, either explicitly or via `by`

, we plummet in performance. Let’s try to do everything in one pass. The idea is to make a composite index that would allow us to sort all events in one fell swoop. We use a trick. Since we have all the timestamps inside the case in the range `[0; 1]`

, then we can split the index into two parts. The integer part will contain the `case_id`

, the fractional part will contain the time fraction. Sorting one such index once will preserve the belonging of the `case_id`

lines, while in one fell swoop we will sort the values inside each case

We launch and get a gain 2 times more against the previous option, both in time and in memory.

1 2 |
median `itr/sec` mem_alloc 826.7us 1013. 54.3KB |

### Option 3 + 1/2. Single-pass, through the composite index, use `set`

Are we stopping? It is possible to stop, although there is still a field for compression. The fact is that with such short execution times, the overhead on the NSE becomes quite significant. If you use direct functions, you can get much better results.

Acceleration by another 5 times, memory consumption is 4 times less

1 2 |
median `itr/sec` mem_alloc 161.5us 5519. 16.3KB |

### Interim summarization

Putting it all together

1 2 3 4 5 6 7 |
expression min median `itr/sec` mem_alloc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> 1 f1(df) 14.3ms 15.38ms 63.2 284.9KB 2 f1_5(df) 24.43ms 29.34ms 29.5 284.9KB 3 f2(dt) 1.55ms 1.69ms 554. 109KB 4 f3(dt) 722us 826.7us 1013. 54.3KB 5 f3_5(dt) 142.5us 161.5us 5519. 16.3KB |

By studying the principles of the algorithms and a close look at the problem, the primary naive version was improved by 90 times in performance for free, and the memory consumption was reduced by 18 times. These are indicators that are poorly achievable by horizontal scaling of servers.

## Timestamping recording start and end

Let’s complicate the task a little. Now for each entry, it is necessary to form the start time and the end time (“doctor’s appointment”). In this case, all records must follow the original sequence in time and the end time for each record must be no earlier than the start time.

We will reduce the number of considered scenarios, leaving only the most interesting ones. And we don’t even mention any options with explicit cycles.

### Option 1. Straight

The first most reasonable and obvious option is to create start marks, as in the previous task, and then put a random ending for each mark so that the rules are followed. We iterate line by line, there is a specificity that the fractional start of the next case will, as a rule, be less than the beginning of the last operation of the previous one. But this is also solvable.

In general, less than a second, but obviously this is VERY far from optimal.

1 2 |
median `itr/sec` mem_alloc 28.16ms 30.7 2.06MB |

### Option 2. Single-pass, through composite index and matrices

The idea is as follows. We need to get 2 columns, in which the values will be sorted line by line and the value of the first column does not exceed the value of the second column for each line. But this is a matrix! A matrix is a single chunk in memory, accompanied by sharding attributes. Working with them is much faster than classic data frames.

In an easy way, time and memory were reduced by almost 30 times! And the code has become much simpler and more straightforward.

1 2 |
median `itr/sec` mem_alloc 1.04ms 733. 74.38KB |

### Option 2+1/2. Single-pass, through the composite index, matrices and `set`

Perfectionism in action. Sped up 4 times more.

1 2 |
median `itr/sec` mem_alloc 278.1us 2781. 57.55KB |

### Interim summarization

Putting it all together.

1 2 3 4 |
median `itr/sec` mem_alloc 28.16ms 30.7 2.06MB 1.04ms 733. 74.38KB 278.1us 2781. 57.55KB |

The primary naive version was also improved for free by 90 times in performance, and memory consumption was reduced by 35 times.

## Conclusion

The examples are quite real, everyone can experiment on their own machine. The absolute figures will naturally differ, but the trend will remain unchanged. To make sure that a large amount of code in the world is written according to the most naive options, it is enough to scratch the repositories on the github. And after all, not always significant optimization is a huge incomprehensible code and a lot of time. The optimized code is even more compact and transparent, intermediate passes are eliminated.

Therefore, when BigData representatives request considerable computing resources and the required time for a relatively small task, there is always a vague doubt that something is wrong under the hood.

### Related Posts

### Leave a Reply Cancel reply

### Service

### Categories

- DEVELOPMENT (87)
- DEVOPS (39)
- FRAMEWORKS (17)
- IT (21)
- QA (14)
- SECURITY (13)
- SOFTWARE (11)
- UI/UX (6)
- Uncategorized (7)