4.9 on Clutch

Maryna Medushevska

How Following Clean Code Practices Can Hurt Performance And What to Do

Product engineering

Oleksandr Mamchich, the head of mobile development at Syndicode, continues sharing his experience in creating high-performance solutions. Read his previous post if you are interested in clean code principles and how they help improve code readability and simplify maintenance.

High-performance use cases, such as real-time processing, artificial intelligence, and augmented reality, require highly optimized solutions to handle the heavy workload and ensure a good user experience.

Mobile development presents a special challenge in this regard since mobile developers are limited not only by peak performance of hardware but also by long-term load capabilities and a limited power source.

While clean code, particularly the SOLID approach, is a popular way of adhering to the above principles, it can lead to overengineering, resulting in unnecessary costs, delays, and complications.

This article will provide several benchmarks and explore how to modify clean code practices to boost performance while keeping code maintainability.

Violating clean code principles to improve performance

Example 1

Consider a scenario where we have a plain area filled with various shapes, such as points, circles, and rectangles. We need to calculate the sum of squares of shapes within a specific rectangular area.

This is a high-performance task as it involves processing a large number of objects and performing complex calculations on them. C++ is a popular language for this kind of task, and the principles of writing clear code apply to it equally.

class Figure
{
public:
    Figure() {}
    virtual ~Figure() {}
    virtual bool isInside(const Rect2f& area) const = 0;
    virtual float square() const = 0;
    float squareIfInside(const Rect2f& area);
private:
};


class Point: public Figure
{
public:
    Point() {}
    Point(const Rect2f& area);
    ~Point() {}
    bool isInside(const Rect2f& area) const override
    {
        return common::Evaluate::isPointInArea(_pos, area);
    }
    float square() const override
    {
        return common::Evaluate::pointSquare(_pos);
    }
private:
    Vec2f _pos;
};


class Circle: public Figure
{
public:
    Circle(): _radius(0) {}
    Circle(const Rect2f& area);
    ~Circle() {}
    bool isInside(const Rect2f& area) const override
    {
        return common::Evaluate::isCircleInArea(_center, _radius, area);
    }
    float square() const override
    {
        return common::Evaluate::circleSquare(_center, _radius);
    }
private:
    Vec2f _center;
    float _radius;
};


class Rectangle: public Figure
{
public:
    Rectangle() {}
    Rectangle(const Rect2f& area);
    ~Rectangle() {}
    bool isInside(const Rect2f& area) const override
    {
        return common::Evaluate::isRectangleInArea(_pos, _size, area);
    }
    float square() const override
    {
        return common::Evaluate::rectangleSquare(_pos, _size);
    }
private:
    Vec2f _pos;
    Vec2f _size;
};

The fragment above defines a hierarchy of classes for different types of figures that can be placed on a 2D plane.

Now let’s create a world, fill it with random objects, and shuffle them to simulate random creation and destruction over a runtime. Then, we will run an evaluation of the square multiple times to get a reliable execution duration.

void World::initialize()
{
    srand(100500);
    int objectsCount = (int)_figures.size();
    for (int i = 0; i < objectsCount; ++i)
    {
        if (_figures[i] != nullptr)
            delete _figures[i];
        _figures[i] = createRandomFigure(_allArea);
    }
    
    // Shuffle to simulate non-sequential objects creation
    for (int i = 0; i < objectsCount; ++i)
    {
        int i0 = rand() % objectsCount;
        int i1 = rand() % objectsCount;
        if (i0 != i1)
        {
            Figure* figure = _figures[i0];
            _figures[i0] = _figures[i1];
            _figures[i1] = figure;
        }
    }
}


double World::benchmark(int aCount, long long& oResult)
{
    using namespace std::chrono;
    auto start = high_resolution_clock::now();
    double square = 0.0;
    for (int c = 0; c < aCount; ++c)
    {
        for (auto & f : _figures)
            square += f->squareIfInside(_viewArea);
    }
    oResult = (long long)(square / aCount);
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
    return 0.001 * duration.count() / aCount;
}

The following code fragment runs the benchmark.

void run_shapes_test()
{
    printf("Running Shapes Test\n");
    static const int kObjectsCount = 100000;
    static const int kIterationsCount = 1000;
    double ref = 0.0;
    {
        s1::World world(kObjectsCount);
        world.initialize();
        long long result = 0;
        double ms = world.benchmark(kIterationsCount, result);
        ref = ms;
        printf("V1 T = %10.6f ms/pass | Boost: %.2f | %lld\n", ms, ref / ms, result);
    }
}

And here’s the result of running a benchmark test.

V1 T =   1.583134 ms/pass | Boost: 1.00 | 5864078

We’ve got 1.58 ms per pass through 100,000 figures–not bad.

Note that the square of a point is always zero. From a logical point of view, we could skip all the points. However, the clean code principles won’t let us do that.

But what if we break the OCP (Open Closed Principle) and replace polymorphism with the switch operator? Let’s change our code in the following way: add an enum of figure types and make methods of inherited classes non-overriding.

class Figure
{
public:
    enum class Type
    {
        Point = 0,
        Circle,
        Rectangle,
    };
public:
    Figure(Type aType): _type(aType) {}
    virtual ~Figure() {}
    float squareIfInside(const Rect2f& area);
private:
    Type _type;
};
// Inherited classes remain intact except the 'override' keyword

The square evaluation method implementation now contains a fat switch. Let’s remove it.

template <Figure::Type FT>
static float impl_squareIfInside(const Figure* f, const Rect2f& area)
{
    typedef typename Cast<FT>::FigureType FigureType;
    const FigureType* ptr = (const FigureType*)f;
    return ptr->isInside(area) ? ptr->square() : 0.0f;
}


float Figure::squareIfInside(const Rect2f& area)
{
    switch (_type)
    {
        case Type::Point:
            return impl_squareIfInside<Type::Point>(this, area);
        case Type::Circle:
            return impl_squareIfInside<Type::Circle>(this, area);
        case Type::Rectangle:
            return impl_squareIfInside<Type::Rectangle>(this, area);
    }
}

The benchmark code remains the same. Let’s run!

V1 T =   1.583134 ms/pass | Boost: 1.00 | 5864078
V2 T =   1.132633 ms/pass | Boost: 1.40 | 5864078

Instead of pasting the long assembler code here, let us explain what just happened. The intersection and square evaluation methods have in-place declarations, so they are all inlined, resulting in plain execution code with no virtual or any calls.

What if we take it a step further and break the DIP (Dependency Inversion Principle)? Instead of having an abstract base class with multiple implementations, we will define a concrete `Figure` with enough memory to support any of its inherited classes and a switch statement inside.

Additionally, since we now only have one `Figure` of one size, we can place it into an array to mitigate the memory fragmentation issue. Here is the code.

class Figure
{
public:
    enum class Type
    {
        Point = 0,
        Circle,
        Rectangle,
    };
public:
    Figure();
    Figure(Type aType, const Rect2f area);
    ~Figure() {}


    float squareIfInside(const Rect2f& area)
    {
        switch (_type)
        {
            case Type::Point:
                return common::Evaluate::isPointInArea(_ver[0], area) ? common::Evaluate::pointSquare(_ver[0]) : 0.0f;
            case Type::Circle:
                return common::Evaluate::isCircleInArea(_ver[0], _ver[1].x, area) ? common::Evaluate::circleSquare(_ver[0], _ver[1].x) : 0.0f;
            case Type::Rectangle:
                return common::Evaluate::isRectangleInArea(_ver[0], _ver[1], area) ? common::Evaluate::rectangleSquare(_ver[0], _ver[1]) : 0.0f;
        }
    }


private:
    Type _type;
    Vec2f _ver[2];
};

The above code is much more cache-friendly compared to the previous version and doesn’t have static casts, considerably saving time.

V1 T =   1.583134 ms/pass | Boost: 1.00 | 5864078
V2 T =   1.132633 ms/pass | Boost: 1.40 | 5864078
V3 T =   1.015603 ms/pass | Boost: 1.56 | 5864078

We will now go savage and break the encapsulation. Instead of owning its memory, the `Figure` class will rely on an external allocator to manage the allocation and storage of the figure data while only storing the figure type and its data address in the allocator’s space.

class Figure
{
public:
    enum class Type
    {
        Point = 0,
        Circle,
        Rectangle,
    };
public:
    Figure();
    Figure(Type aType, const Rect2f area, FigureDataAllocator* allocator);
    ~Figure() {}
private:
    Type _type;
    int _id;
};

The initialization code remains the same, but now the `World` class owns all the figures’ data through its ownership of the `FigureDataAllocator.` The `FigureDataAllocator` allocates only the necessary amount of memory for each figure and has internal plain data storage for each figure type.

This means that the `World` can iterate through particular types of figures one by one without needing to use casts, switches, or other memory overheads.

Let’s run the benchmarking operation.

double World::benchmark(int aCount, long long& oResult)
{
    using namespace std::chrono;
    
    auto start = high_resolution_clock::now();
    double square = 0.0;
    for (int c = 0; c < aCount; ++c)
    {
        for (PointData& d : _allocator->getAllPointsData())
        {
            if (common::Evaluate::isPointInArea(d.pos, _viewArea))
                square += common::Evaluate::pointSquare(d.pos);
        }
        for (CircleData& d : _allocator->getAllCircleData())
        {
            if (common::Evaluate::isCircleInArea(d.center, d.radius, _viewArea))
                square += common::Evaluate::circleSquare(d.center, d.radius);
        }
        for (RectangleData& d : _allocator->getAllRectangleData())
        {
            if (common::Evaluate::isRectangleInArea(d.pos, d.size, _viewArea))
                square += common::Evaluate::rectangleSquare(d.pos, d.size);
        }
    }
    oResult = (long long)(square / aCount);
    
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
    
    return 0.001 * duration.count() / aCount;
}

The resultant value gives us a 78% boost in the completion time compared to the previous versions.

V1 T =   1.583134 ms/pass | Boost: 1.00 | 5864078
V2 T =   1.132633 ms/pass | Boost: 1.40 | 5864078
V3 T =   1.015603 ms/pass | Boost: 1.56 | 5864078
V4 T =   0.890630 ms/pass | Boost: 1.78 | 5864078

At this point, we can make some minor optimizations, such as removing points from evaluation or multiplying by pi only once instead of for each circle. Here are the results.

V1 T =   1.583134 ms/pass | Boost: 1.00 | 5864078
V2 T =   1.132633 ms/pass | Boost: 1.40 | 5864078
V3 T =   1.015603 ms/pass | Boost: 1.56 | 5864078
V4 T =   0.890630 ms/pass | Boost: 1.78 | 5864078
V5 T =   0.654476 ms/pass | Boost: 2.42 | 5864078

Impressive, isn’t it? Let’s break down the execution time of the first benchmarking—1.58 ms—into those of individual tasks:

0.69 ms, or 43.6% of the time, is purely an overhead for the “clear code”;
0.24 ms, or 15.1% of the time, cannot be optimized due to the “clear code”;
0.65 ms, or 41.1% of the time, is a useful payload required for computing.

Doesn’t it look like a disaster? To put things into perspective, consider that the iPhone 14 outperforms the iPhone 11 by 30% in single-core performance, according to Geekbench single-core benchmark. This means the “clear code” overhead alone accounts for more than three years’ worth of hardware evolution in this particular test.

Meanwhile, the code itself only grew by 30% and contained just one more entity than the original version. Additionally, all the broken principles remain completely hidden and transparent to the end user of the code.

Example 2

Let’s take a look at another code example. We have an `Image` class that’s always four-channel for simplicity. It also comes with both safe and unsafe accessors to its data.

However, the unsafe accessors break the DIP as users of the `Image` class must ensure that X and Y values are within valid ranges.

class Image
{
public:
    Image(int aWidth, int aHeight);
    ~Image();
    
    inline simd::uchar4 getPixel(int x, int y) const
    {
        if (x < 0 || y < 0 || x >= _width || y >= _height)
            return simd::uchar4();
        
        return getPixel_unsafe(x, y);
    }
    
    inline void setPixel(int x, int y, const simd::uchar4& aPixel)
    {
        if (x < 0 || y < 0 || x >= _width || y >= _height)
            return;
        
        setPixel_unsafe(x, y, aPixel);
    }
    
    inline simd::uchar4 getPixel_unsafe(int x, int y) const
    {
        return _data[y * _width + x];
    }
    
    inline void setPixel_unsafe(int x, int y, const simd::uchar4& aPixel)
    {
        _data[y * _width + x] = aPixel;
    }


private:
    std::shared_ptr<std::vector<simd::uchar4>> _holder;
    simd::uchar4* _data;
    int _width;
    int _height;
};

The code below is the benchmark. First, an image of size 4096×3072 is created and filled with random values. Then, the image is traversed using both safe and unsafe accessors. To ensure consistency, all the pixels are XORed.

{
    auto start = high_resolution_clock::now();


    uint32_t check = 0;


    for (int i = 0; i < kIterations; ++i)
    {
        for (int x = 0; x < kWidth; ++x)
        {
            for (int y = 0; y < kHeight; ++y)
            {
                simd::uchar4 p = image.getPixel(x, y);
                uint32_t* pp = (uint32_t*)&p;
                check ^= *pp;
            }
        }
    }


    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
    double ms = 0.001 * duration.count() / kIterations;
    ref = ms;


    printf("V1 T: %10.6f ms/pass | B: %8.2f | %lx\n", ms, ref / ms, (unsigned long)check);
}


{
    auto start = high_resolution_clock::now();


    uint32_t check = 0;
    
    for (int i = 0; i < kIterations; ++i)
    {
        for (int x = 0; x < kWidth; ++x)
        {
            for (int y = 0; y < kHeight; ++y)
            {
                simd::uchar4 p = image.getPixel_unsafe(x, y);
                uint32_t* pp = (uint32_t*)&p;
                check ^= *pp;
            }
        }
    }
    
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
    double ms = 0.001 * duration.count() / kIterations;
    
    printf("V2 T: %10.6f ms/pass | B: %8.2f | %lx\n", ms, ref / ms, (unsigned long)check);
}

The benchmark results look promising, with an 85% performance boost.

V1 T:  71.186636 ms/pass | B:     1.00 | 6d39f145
V2 T:  38.423909 ms/pass | B:     1.85 | 6d39f145

Now, let’s make a small change by swapping the ‘x’ and ‘y’ parameters in the ‘for’ cycles like this:

{
    auto start = high_resolution_clock::now();


    uint32_t check = 0;
    
    for (int i = 0; i < kIterations; ++i)
    {
        for (int y = 0; y < kHeight; ++y)
        {
            for (int x = 0; x < kWidth; ++x)
            {
                simd::uchar4 p = image.getPixel_unsafe(x, y);
                uint32_t* pp = (uint32_t*)&p;
                check ^= *pp;
            }
        }
    }
    
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop - start);
    double ms = 0.001 * duration.count() / kIterations;
    
    printf("V3 T: %10.6f ms/pass | B: %8.2f | %lx\n", ms, ref / ms, (unsigned long)check);
}

Someone might think it was a mistake, but look at the benchmarking results

V1 T:  71.186636 ms/pass | B:     1.00 | 6d39f145
V2 T:  38.423909 ms/pass | B:     1.85 | 6d39f145
V3 T:   1.337364 ms/pass | B:    53.23 | 6d39f145

By going through the ‘x’ parameter in the internal cycle and through ‘y’ parameter in the external cycle, we achieve a 53.23 times faster code due to its cache-friendly nature. This method allows for the traversal of a 48 Mb image from its beginning to the end without any jumps.

However, going through the ‘x’ parameter in the external cycle and ‘y’ in the internal cycle creates 12 kB long jumps, occurring at least 4096 times. This leads to the CPU’s branching prediction and speculative execution becoming almost impossible, and thousands of cache miss conditions spawn, followed by delays in cache warming up.

Although the method used in this example is a direct breach of the DIP, it is a feature of almost every modern computer, and knowing this can boost performance by dozens of times.

Summary: clean code is not a dogma

The examples provided in this post do not aim to undermine the importance of clean code concepts, SOLID principles, and other coding practices. In the majority of projects, these rules do assist in the development of successful and durable projects from a technical perspective.

The main objective here was to demonstrate that there are instances where you must step back from conventional clean code approaches to build a great application.

It’s not uncommon to come across dogmatic statements claiming that certain approaches, like OOP and SOLID, are the best and only way to go. However, developers must view clear code conventions as a useful tool that should be applied wisely.

Improper use of these conventions can result in overengineering, performance degradation, power consumption issues, or longer development times.

Innovation

Services

Technologies

Solutions

How Following Clean Code Practices Can Hurt Performance And What to Do

Violating clean code principles to improve performance

Example 1

Example 2

Summary: clean code is not a dogma