A number of algorithms can be distinguished that are basic and underlie almost every line of programs written in high-level languages. It is good to have at hand the classic multivolume work of Donald Knuth “The Art of Computer Programming”, where many basic algorithms are analyzed in detail. But reading and assimilating everything is a task that requires a lot of effort and time, which must be somehow motivated.

Many may assume that it was necessary to know the nuances 50 years ago, but now you can use ready-made packages and functions and not dive into details. However, this is not the case. Likewise, no one has canceled the importance of understanding the representation of methods for storing data in memory and processing it in the processor.

Next, let’s look at the nuances using the example of sorting functions. Sorting and searching are used most often in all data manipulation. Saving a few milliseconds on operations can lead to an overall reduction in computing hours on significant data.

Introduction

Let us formulate a simplified formulation of the problem, based on practical problems for working with time series. A set of data is examined containing a number of use cases, within which certain sequences of events occur.


Suppose that for algorithmic processing it is necessary to prepare a test dataset containing three columns:

  • case_id – unique case / use case identifier;
  • record – a log record of an event in a case;
  • start – registration time.

Libraries used

The most interesting task is the generation of timestamps. Events should go sequentially in time for each case separately. First, let’s prepare a simple template. In a particular case, we will take a small number of cases for demonstration. There can be 10^5-10^n in production, which is determined by the tasks.

Now let’s get down to an interesting block – generating timestamps. For simplicity, the problem will be reduced to the distribution of shares in the interval [0; 1] within each case. Let’s leave the translation to the real unixtimestamp outside, it’s not interesting. Explicit looping options are also out of bounds. The execution times are given on a conditional computer, the main thing is that everything is performed on one.

Create one timestamp

Option 1. Straight

This option is offered in most cases. And that, everything is simple and clear.

We get such conditional indicators. Probably not bad. But do not forget that there are only 100 cases.

Let’s think about what can be improved?

Option 1 + 1/2. Straight line + fast number generator

There is a good rTRNG library. At large volumes, it provides a significant acceleration, including due to parallel generation. Let’s just check:

We didn’t get any gain on small volumes. It’s all? Of course not. We know tidyverse is slower than data.table, so let’s try it.

Option 2. One-pass, through data.table indexes

Here we will try to apply the first trick – sort the vector of times by indices, and then reassign it.

It turns out quite well, the acceleration is 15-20 times and the memory is required almost three times less.

Are we stopping? Why yes?

Option 3. Single-pass through a composite index

In fact, as soon as we fall into a loop, either explicitly or via by, we plummet in performance. Let’s try to do everything in one pass. The idea is to make a composite index that would allow us to sort all events in one fell swoop. We use a trick. Since we have all the timestamps inside the case in the range [0; 1], then we can split the index into two parts. The integer part will contain the case_id, the fractional part will contain the time fraction. Sorting one such index once will preserve the belonging of the case_id lines, while in one fell swoop we will sort the values ​​inside each case

We launch and get a gain 2 times more against the previous option, both in time and in memory.

Option 3 + 1/2. Single-pass, through the composite index, use set

Are we stopping? It is possible to stop, although there is still a field for compression. The fact is that with such short execution times, the overhead on the NSE becomes quite significant. If you use direct functions, you can get much better results.

Acceleration by another 5 times, memory consumption is 4 times less

Interim summarization

Putting it all together

By studying the principles of the algorithms and a close look at the problem, the primary naive version was improved by 90 times in performance for free, and the memory consumption was reduced by 18 times. These are indicators that are poorly achievable by horizontal scaling of servers.

Timestamping recording start and end

Let’s complicate the task a little. Now for each entry, it is necessary to form the start time and the end time (“doctor’s appointment”). In this case, all records must follow the original sequence in time and the end time for each record must be no earlier than the start time.

We will reduce the number of considered scenarios, leaving only the most interesting ones. And we don’t even mention any options with explicit cycles.

Option 1. Straight

The first most reasonable and obvious option is to create start marks, as in the previous task, and then put a random ending for each mark so that the rules are followed. We iterate line by line, there is a specificity that the fractional start of the next case will, as a rule, be less than the beginning of the last operation of the previous one. But this is also solvable.

In general, less than a second, but obviously this is VERY far from optimal.

Option 2. Single-pass, through composite index and matrices

The idea is as follows. We need to get 2 columns, in which the values will be sorted line by line and the value of the first column does not exceed the value of the second column for each line. But this is a matrix! A matrix is a single chunk in memory, accompanied by sharding attributes. Working with them is much faster than classic data frames.

In an easy way, time and memory were reduced by almost 30 times! And the code has become much simpler and more straightforward.

Option 2+1/2. Single-pass, through the composite index, matrices and set

Perfectionism in action. Sped up 4 times more.

Interim summarization

Putting it all together.

The primary naive version was also improved for free by 90 times in performance, and memory consumption was reduced by 35 times.

Conclusion

The examples are quite real, everyone can experiment on their own machine. The absolute figures will naturally differ, but the trend will remain unchanged. To make sure that a large amount of code in the world is written according to the most naive options, it is enough to scratch the repositories on the github. And after all, not always significant optimization is a huge incomprehensible code and a lot of time. The optimized code is even more compact and transparent, intermediate passes are eliminated.

Therefore, when BigData representatives request considerable computing resources and the required time for a relatively small task, there is always a vague doubt that something is wrong under the hood.