Optimizing the code using the sort function as an example

A number of algorithms can be distinguished that are basic and underlie almost every line of programs written in high-level languages. It is good to have at hand the classic multivolume work of Donald Knuth “The Art of Computer Programming”, where many basic algorithms are analyzed in detail. But reading and assimilating everything is a task that requires a lot of effort and time, which must be somehow motivated.

Many may assume that it was necessary to know the nuances 50 years ago, but now you can use ready-made packages and functions and not dive into details. However, this is not the case. Likewise, no one has canceled the importance of understanding the representation of methods for storing data in memory and processing it in the processor.

Next, let’s look at the nuances using the example of sorting functions. Sorting and searching are used most often in all data manipulation. Saving a few milliseconds on operations can lead to an overall reduction in computing hours on significant data.

Introduction

Let us formulate a simplified formulation of the problem, based on practical problems for working with time series. A set of data is examined containing a number of use cases, within which certain sequences of events occur.

Suppose that for algorithmic processing it is necessary to prepare a test dataset containing three columns:

case_id – unique case / use case identifier;
record – a log record of an event in a case;
start – registration time.

Libraries used

library(tidyverse)
library(data.table)
library(rTRNG)

library(tidyverse)

library(data.table)

library(rTRNG)

The most interesting task is the generation of timestamps. Events should go sequentially in time for each case separately. First, let’s prepare a simple template. In a particular case, we will take a small number of cases for demonstration. There can be 10^5-10^n in production, which is determined by the tasks.

# A tibble: 10 x 2
     case_id recs          
       &lt;int&gt; &lt;chr&gt;         
   1       1 first         
   2       1 one and a half
   3       1 second        
   4       1 third         
   5       1 fourth        
   6       1 fifth         
   7       1 sixth         
   8       2 first         
   9       2 one and a half
  10       2 second

# A tibble: 10 x 2

case_id recs

1 1 first

2 1 one and a half

3 1 second

4 1 third

5 1 fourth

6 1 fifth

7 1 sixth

8 2 first

9 2 one and a half

10 2 second

Now let’s get down to an interesting block – generating timestamps. For simplicity, the problem will be reduced to the distribution of shares in the interval [0; 1] within each case. Let’s leave the translation to the real unixtimestamp outside, it’s not interesting. Explicit looping options are also out of bounds. The execution times are given on a conditional computer, the main thing is that everything is performed on one.

Create one timestamp

Option 1. Straight

This option is offered in most cases. And that, everything is simple and clear.

We get such conditional indicators. Probably not bad. But do not forget that there are only 100 cases.

median `itr/sec` mem_alloc
 15.38ms      63.2   284.9KB

1 2	median `itr/sec` mem_alloc 15.38ms 63.2 284.9KB

Let’s think about what can be improved?

Option 1 + 1/2. Straight line + fast number generator

There is a good rTRNG library. At large volumes, it provides a significant acceleration, including due to parallel generation. Let’s just check:

 median `itr/sec` mem_alloc
 29.34ms      29.5   284.9KB

1 2	median `itr/sec` mem_alloc 29.34ms 29.5 284.9KB

We didn’t get any gain on small volumes. It’s all? Of course not. We know tidyverse is slower than data.table, so let’s try it.

Option 2. One-pass, through data.table indexes

Here we will try to apply the first trick – sort the vector of times by indices, and then reassign it.

It turns out quite well, the acceleration is 15-20 times and the memory is required almost three times less.

median `itr/sec` mem_alloc 
  1.69ms     554.      109KB

1 2	median `itr/sec` mem_alloc 1.69ms 554. 109KB

Are we stopping? Why yes?

Option 3. Single-pass through a composite index

In fact, as soon as we fall into a loop, either explicitly or via by, we plummet in performance. Let’s try to do everything in one pass. The idea is to make a composite index that would allow us to sort all events in one fell swoop. We use a trick. Since we have all the timestamps inside the case in the range [0; 1], then we can split the index into two parts. The integer part will contain the case_id, the fractional part will contain the time fraction. Sorting one such index once will preserve the belonging of the case_id lines, while in one fell swoop we will sort the values inside each case

We launch and get a gain 2 times more against the previous option, both in time and in memory.

  median `itr/sec` mem_alloc
 826.7us    1013.     54.3KB

1 2	median `itr/sec` mem_alloc 826.7us 1013. 54.3KB

Option 3 + 1/2. Single-pass, through the composite index, use `set`

Are we stopping? It is possible to stop, although there is still a field for compression. The fact is that with such short execution times, the overhead on the NSE becomes quite significant. If you use direct functions, you can get much better results.

Acceleration by another 5 times, memory consumption is 4 times less

  median `itr/sec` mem_alloc
 161.5us    5519.     16.3KB

1 2	median `itr/sec` mem_alloc 161.5us 5519. 16.3KB

Interim summarization

Putting it all together

 expression      min   median `itr/sec` mem_alloc
  &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;
1 f1(df)       14.3ms  15.38ms      63.2   284.9KB
2 f1_5(df)    24.43ms  29.34ms      29.5   284.9KB
3 f2(dt)       1.55ms   1.69ms     554.      109KB
4 f3(dt)        722us  826.7us    1013.     54.3KB
5 f3_5(dt)    142.5us  161.5us    5519.     16.3KB

expression min median `itr/sec` mem_alloc

<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>

1 f1(df) 14.3ms 15.38ms 63.2 284.9KB

2 f1_5(df) 24.43ms 29.34ms 29.5 284.9KB

3 f2(dt) 1.55ms 1.69ms 554. 109KB

4 f3(dt) 722us 826.7us 1013. 54.3KB

5 f3_5(dt) 142.5us 161.5us 5519. 16.3KB

By studying the principles of the algorithms and a close look at the problem, the primary naive version was improved by 90 times in performance for free, and the memory consumption was reduced by 18 times. These are indicators that are poorly achievable by horizontal scaling of servers.

Timestamping recording start and end

Let’s complicate the task a little. Now for each entry, it is necessary to form the start time and the end time (“doctor’s appointment”). In this case, all records must follow the original sequence in time and the end time for each record must be no earlier than the start time.

We will reduce the number of considered scenarios, leaving only the most interesting ones. And we don’t even mention any options with explicit cycles.

Option 1. Straight

The first most reasonable and obvious option is to create start marks, as in the previous task, and then put a random ending for each mark so that the rules are followed. We iterate line by line, there is a specificity that the fractional start of the next case will, as a rule, be less than the beginning of the last operation of the previous one. But this is also solvable.

In general, less than a second, but obviously this is VERY far from optimal.

  median `itr/sec` mem_alloc 
 28.16ms      30.7    2.06MB

1 2	median `itr/sec` mem_alloc 28.16ms 30.7 2.06MB

Option 2. Single-pass, through composite index and matrices

The idea is as follows. We need to get 2 columns, in which the values will be sorted line by line and the value of the first column does not exceed the value of the second column for each line. But this is a matrix! A matrix is a single chunk in memory, accompanied by sharding attributes. Working with them is much faster than classic data frames.

In an easy way, time and memory were reduced by almost 30 times! And the code has become much simpler and more straightforward.

  median `itr/sec` mem_alloc 
  1.04ms     733.    74.38KB

1 2	median `itr/sec` mem_alloc 1.04ms 733. 74.38KB

Option 2+1/2. Single-pass, through the composite index, matrices and `set`

Perfectionism in action. Sped up 4 times more.

  median `itr/sec` mem_alloc 
 278.1us    2781.    57.55KB

1 2	median `itr/sec` mem_alloc 278.1us 2781. 57.55KB

Interim summarization

Putting it all together.

  median `itr/sec` mem_alloc 
 28.16ms      30.7    2.06MB 
  1.04ms     733.    74.38KB 
 278.1us    2781.    57.55KB

median `itr/sec` mem_alloc

28.16ms 30.7 2.06MB

1.04ms 733. 74.38KB

278.1us 2781. 57.55KB

The primary naive version was also improved for free by 90 times in performance, and memory consumption was reduced by 35 times.

Conclusion

The examples are quite real, everyone can experiment on their own machine. The absolute figures will naturally differ, but the trend will remain unchanged. To make sure that a large amount of code in the world is written according to the most naive options, it is enough to scratch the repositories on the github. And after all, not always significant optimization is a huge incomprehensible code and a lot of time. The optimized code is even more compact and transparent, intermediate passes are eliminated.

Therefore, when BigData representatives request considerable computing resources and the required time for a relatively small task, there is always a vague doubt that something is wrong under the hood.

Blog Post

Optimizing the code using the sort function as an example

Introduction

Create one timestamp

Option 1. Straight

Option 1 + 1/2. Straight line + fast number generator

Option 2. One-pass, through data.table indexes

Option 3. Single-pass through a composite index

Option 3 + 1/2. Single-pass, through the composite index, use `set`

Interim summarization

Timestamping recording start and end

Option 1. Straight

Option 2. Single-pass, through composite index and matrices

Option 2+1/2. Single-pass, through the composite index, matrices and `set`

Interim summarization

Conclusion

Leave a Reply Cancel reply

Service

Azure .NET News – January 2025

.NET 9 Features

How can developers leverage the new OpenAI libraries in their projects

Categories

Latest Projects

Azure .NET News – January 2025

.NET 9 Features

Blog Post

Optimizing the code using the sort function as an example

Introduction

Create one timestamp

Option 1. Straight

Option 1 + 1/2. Straight line + fast number generator

Option 2. One-pass, through data.table indexes

Option 3. Single-pass through a composite index

Option 3 + 1/2. Single-pass, through the composite index, use set

Interim summarization

Timestamping recording start and end

Option 1. Straight

Option 2. Single-pass, through composite index and matrices

Option 2+1/2. Single-pass, through the composite index, matrices and set

Interim summarization

Conclusion

Related Posts

Git and team development

How can developers leverage the new OpenAI libraries in their projects

How to make an API for any Android application

Leave a Reply Cancel reply

Service

Azure .NET News – January 2025

.NET 9 Features

How can developers leverage the new OpenAI libraries in their projects

Categories

Tags

Tags

Latest Projects

Azure .NET News – January 2025

.NET 9 Features

Option 3 + 1/2. Single-pass, through the composite index, use `set`

Option 2+1/2. Single-pass, through the composite index, matrices and `set`