Performance advices in C++

performance
performance
performance

Performance advices

Before talking about performance, we need to take into account the top 5 things:

  1. Assuming some operations are faster than others. When it comes to optimizing, never ever assume anything. Benchmark everything. Use a profiler.
  2. Optimization is one of the last steps of a project. Plan for it, but don’t optimize too soon.
  3. Write the code without optimizations first. Code should be correct first.
  4. Have a strategy for optimizing items (Set design goals for performance levels). Be specific.
  5. Don’t evaluate the debugging version, evaluate the release version.

 


1. Preliminary Definitions

1. A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU — by highest level I mean the largest == slowest. The cache hit rate is crucial for performance, since every cache miss results in fetching data from RAM (or worse …) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

2. The principle of locality is to place related data close in memory to allow efficient caching.

3. Cache lines

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.).

How do cache lines work?
If the cache line containing the byte or word you’re loading is not already present in the cache (cache miss), your CPU will request the 64 bytes that begin at the cache line boundary.
Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory.

4. Prefetcher

  • helps optimizing performance without people noticing
  • Sits between CPU and cache and listens to the traffic. If it notices a pattern, it starts prefetching things.
  • Even if it’s slow, prefetcher can still speed things up. (10-30%)

5. False sharing

False sharing is an annoying effect that occurs when two processors apparently do not share any resource but due to undergoing hardware architecture they actually do. For example consider two threads writing in not overlapping memory locations, they do not need any synchronization so happily you are driven to think that you are able to split your data in order to implement a lock less algorithm. Unfortunately you hold a sparkling “centrino duo” processor in where both cores do share the L2 cache and then the data you partition in memory can be mapped on the same cache line. The same scenario is triggered on L1 caches, indeed due to coherency protocols if a thread write to a cache line then the cache line referring the same memory is invalidated on the other processor (cache trashing).

Consider for example the following case:

Very likely that 20 bytes will lie on contiguous memory location and then will be mapped inside the same single cache line, so each time a thread works on his own vector it invalidates the cache line for the other thread and then the hardware has to write and fetch the cache line in and from memory even if it is not strictly needed.


2. Optimization techniques

General

1. Prefer static linking to dynamic linking,  and prefer PDC (position dependent code) over PIC (position independent code)
We have a lot of optimizations that are not available through dynamic linking, such as:

  • whole program optimization
  • local jumps
  • inlining

2. Prefer 64-bit code and 32-bit data

  • In hardware, all optimization effort goes to newer code (64bit code). 32-bit is just in support mode now so it runs slower.
  • 32-bit data is smaller, and it puts less pressure on cache system

3. Prefer a[i++] to a [++i]

  • It’s faster because we have fewer data dependencies.
  • The CPU can increment i simultaneously with fetching the array elements -> it uses more hardware and goes in parallel
  • In case of ++i, we first increment i, and then fetch the elements of the array -> it’s a number of operations that must occur in a sequence

4. Prefer regular memory access patterns

Prefetching! The cache subsystem fetches blocks of memory, before you use it, so you don’t need to wait for memory to be read into cache.
It detects that you’re iterating forward or backwards.

5. Minimize jumps, avoid data dependencies

Avoid jumps into memory (use the caching instead) and avoid dependencies between data (we cannot run instructions in parallel if data is dependent on one another)
Flow (jumps) and dependencies (no instructions in parallel) = the largest enemies of today’s architecture

6. Use the STL containers.

  • It’s standard – most of people know them already.
  • It’s only going to get better as STL vendors focus their efforts on optimization and compiler vendors improve template compilation.
  • It’s already written, debugged, and tested.

7. Consider Using References Instead of Pointers

  • There’s no need to check that the reference is not NULL. References are never NULL
  • References don’t require the * dereference operator. Less typing. Cleaner code.
  • There’s a chance that references could be better optimized by the compilers. Pointers make it extremely difficult for a compiler to know when different variables refer to the same location, which prevents the compiler from generating the fastest possible code.

8. Pass class objects by reference, not by value

  • Passing an object by value requires that the entire object be copied (copy ctor), whereas passing by reference does not invoke a copy constructor, though you pay a “dereference” penalty when the object is used within the function.

9. Prefer Initialization over Assignment / Use Constructor Initialization Lists

  • Initializing an object invokes the object’s copy constructor. That’s it. Defining and then assigning an object invokes both the default constructor and then the assignment operator. Why take two steps when one will do?

10. Be Aware of the Cost of Virtual Functions

  • A single virtual function in a class requires that every class object and derived objects contain an extra pointer. For simple classes, this can as much as double the size of the object. Creating virtual objects costs more than creating non-virtual objects, because the virtual function table must be initialized. And it takes slightly longer to call virtual functions, because of the additional level of indirection.

11. Consider Per-class Allocation

  • One feature of C++ that shows the true power and flexibility of the language is the ability to overload new and delete at the class level.

Storage

1. Use static const for all immutables
2. Use stack for most variables: For the stack variables we have 0 cost of addressing
3. Globals: try not to use them: Before you do anything, the compiler must assume that the global has been seen by all other threads and reloads the data from memory. They are slow to get, as bad as a function call.
4. Use local caching (local stack objects) when possible, rather than sharing data between threads.

Integrals

1. Prefer 32-bit integers

  • The reason for that is that the ALU (Arithmetic logic unit)in the  CPU can do one 64-bit operation at a time, or two 32-bit at the same time, by using the same hardware and bus.
  • Can double the speed of things of operations by using 32-bit of operations.
  • If you think smaller is better -> they are converted to 32-bit so there’s no optimization there.

2. prefer unsigned to signed: It is faster, except when you convert unsigned to floating point
3. Apply the law of small numbers: most numbers are small. Take that into account when optimizing code.

Floating point

  • Inside the CPU you have more than one units for manipulating fp numbers, so you can do in parallel
  • Conversion from int to FP is cheap, from FP to int is very expensive

Strength reduction

Speed hierarchy (from cheapest to slowest):

  1. comparisons
  2. (u)int add, subtract bitops shift
  3. FP add, sub (done by separate unit, so very fast apparently)
  4. (u)int32 mul, FP mul
  5.  FP division , remainder
  6. (u) int division, remainder

Reason that int division is slower than FP division?

  • The format of floating point numbers : mantissa and exponent. => the exponential format in which fp numbers are stored is already friendly to multiplication and division.
  • The mantissa can do approx and have precisions which you can’t afford to have when you do division of integers.

Example of code

After:

  • Reduced the number of divisions
  • Early-exit for most usual cases
  • More comparisons and additions, fewer /=
  • Infinite loop => one jump
  • Trading expensive divisions vs cheap comparisons

For a optimization,we always start with infinite loops. It’s the fastest code (unconditional jump at the end) and inside you can organize computations at its best.

Loop unrolling

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program’s execution speed at the expense of its binary size, which is an approach known as space–time tradeoff.

The goal of loop unwinding is to increase a program’s speed by reducing or eliminating instructions that control the loop, such as pointer arithmetic and “end of loop” tests on each iteration; reducing branch penalties; as well as hiding latencies including the delay in reading data from memory. To eliminate this computational overhead, loops can be re-written as a repeated sequence of similar independent statements.

Use static hinting for the compiler – GNU Example

This is shown only as an example, but really this should be used very very carefully, as even 1 out of 10 items going in the unlikely case will actually slow the program more than it would be without having the compiler hint at all. It is better to split the hot code part and the error-case part, and let the compiler optimize and add or not add data in the cache.

Minimize indirect writes (writes through a pointer) -> something that requires an indirection

  • In hardware, it’s very difficult to optimize, to speculate, and to put in a register

Keep in mind that a write is a read followed by a write

  • when you transfer data with memory, you don’t transfer a word or a byte, but a cache line (64 bytes) -> that’s the unit of transfer.
  • The CPU reads the whole cache line, changes the specific word/byte/etc, and puts it back => read and write (The unit of writing is a cache line)
  • if you happen to write a whole 64 bytes of data at a time, and if those are cache aligned => that write is twice as fast => no more reading occurs.

When you write memory, even it’s identical to what was before, the computer is gonna mark a bit that cache is dirty.
Whatever memory is dirty is gonna be written back to main memory ( around 100 cycles). The more dirty cache you have, the more writes you need to do.

Instruction-level-parallelism (ILP)

  • pipelining
  • superscalar execution
  • out-of-order execution
  • register renaming
  • speculative execution

IF we want better ILP, we need to have fewer data dependencies.


3. Performance advices

  1. Don’t declare (actually, define) object variables before their use/initialization (as in C). This necessitates that the constructor AND assignment operator functions will run, and for complex objects, could be costly.
  2. Prefer pre-increment to post-increment. This will only matter for iterators and user-defined types with overloaded operators.
  3. Use the smallest primitive types possible. Don’t use a long int to store a value in the range 0..5. This will reduce overall memory usage, improving locality and thus overall performance.
  4. Use heap memory (dynamic allocation) only when necessary. Dynamic allocations and deallocations are expensive.
  5. Minimize the use of temporaries (particularly with string-based processing).
  6. Know your compiler/linker options. Also know which ones cause non-standard behavior.
  7. Know the performance/functionality tradeoffs of the STL containers (e.g., don’t frequently insert into a vector, use a list).
  8. Consider the order of evaluation of compound conditionals. E.g., given if (a && b), if b is more likely to be false, place it first to save the evaluation of a.
  9. Use the right container
    1. Sequence containers
      • Do not use vector for data of unknown size if you are going to keep adding data to it. If you are going to repeatedly call push_back(), either use reserve() or use a deque instead.
      • If you are going to be adding/removing data in the middle of the container, list is probably the right choice.
      • If you are going to be adding/removing data from both ends of the container, deque is probably the right choice.
      • If you need to access the nth element of the container, list is probably the wrong choice.
      • If you need to both access the nth element of the container and add/remove elements in the middle, benchmark all three containers.
      • If you have C++0x capability and are using a list but you never move backwards through the list, you may find forward_list more to your liking. It won’t be faster but it will take up less space.
      • Note that this advice becomes more applicable the larger the container. For smaller containers, vector may always be the right choice simply because of the lower constant factors. When in doubt, benchmark.
    2. Associative containers
      • Prefer unordered_containers over the ordered version if you do not care about the order of the elements.
  10. Use exceptions judiciously
    • Don’t use exceptions in your normal code path. Save them for when you actually have an exceptional circumstance.
  11. Love templates
    • Templates will cost you at compile time and space-wise, but the performance gains can be amazing if you have calculations that would otherwise be performed at run-time; sometimes even something so subtle as a downcast.
  12. Avoid dynamic_cast
    • dynamic_cast is sometimes the only choice for doing something, but oftentimes the use of dynamic_cast can be eliminated by improving design.

 


4. Performance tricks for cache friendly code

Spatial locality of the Cache – Array traversal

The following example is quite slow, because it will cause a lot of cache misses.

Traversing the row will be a lot faster rather than traversing by columns.

 

Execution time will first be bound by computation => how fast CPU computes the line inside the for instructions.

When the rows and columns increase, execution time is bound by data access => how fast CPU can get the data from memory.

Usage of data

In case we use the same data structures together, we should gather them together in order to make use of the cache locality and cache hits.

Before:

After:

Temporal cache coherency

We should initialize and construct the objects (Especially big ones) right before we use them. By delaying this, we can assure that this cost will be performed only when needed, and that the data will still be in cache when we try to use it.

Avoid unpredictable branches

Branch predictor’s job is to identify a pattern and follows it. In order to speed things up, they will try to guess the direction a branch will go, and execute that code.
In case the direction was good, we got a performance gain, otherwise, it will flush the pipeline and roll back to the branch, and restart to the other path.
If you want to write well-performing code, you should try to avoid branches and try to replace them with some other operations.
In case we need to count or do some things on a list of items, we should sort the data first. This will give us a great gain in performance due to being easy to predict (will first get false for a streak of values then becomes true for all later values).
Another example would be to change the logic, such in the following code:

1. Original loop

2. Loop interchange (row vs column first)

3. Move the if instruction

4. Collapse the inner loop into one expression

Memory Alignment / padding

Natural memory alignment generally refers to the alignment of individual variables. Natural memory alignment usually pertains to how the CPU’s load/store instructions are architected and implemented.
Assuming the following size:

  • char: 1 bytes
  • short: 2 bytes
  • int: 4 bytes
  • double: 8 bytes

total size = 15 bytes

Before

Foo size = 24 bytes

Memory needs to start from a multiple of word, so it applies padding in order to align the next data entry.
 It is important to note that the last member is padded with the number of bytes required so that the total size of the structure should be a multiple of the largest alignment of any structure member.

This will be the alignment of our structure in memory (X = byte used, O = padding)

After

Foo size = 16 bytes
This will be the alignment of our structure in memory (X = byte used, O = padding)

Denormals (handling floats with small numbers)

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can’t handle them directly and must trap and resolve them using microcode.

Denormalized numbers (now often called subnormal numbers) fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is ‘subnormal’.

A small visualisation of this interesting phenomenon:

  • Column 1: a float, divided by 2 for every iteration
  • Column 2: the binary representation of this float
  • Column 3: the time taken to sum this float 1e7 times

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.


5. Compiler optimization

Compiler C++ Language Settings

The following table lists all of the Microsoft Visual C++ 6.0 “C++” optimizations for reference. Alternate methods are given when an optimization can be specified directly in the code. Microsoft default values for release builds are highlighted.

Name Option Description
No Vtable __declspec

(novtable)

Stops the compiler from generating code to initialize the vfptr in the constructor. Apply to pure interface classes for code size reduction.
No Throw __declspec

(nothrow)

Stops the compiler from tracking unwindable objects. Apply to functions that don’t throw exceptions for code size reduction. Same as using C++ throw() specification.
Disable RTTI /GR- Turn off run time type information.
Exception handling /GX Turn on exception handling.
Inline expansion /Ob1 Allow functions marked inline to be inline. Alternate: inline, __forceinline, #pragma inline_depth/inline_recursion
Inline any /Ob2 Inline functions deemed appropriate by compiler. Alternate: #pragma auto_inline/inline_depth/inline_recursion
Ctor displacement /vd0 Disable constructor displacement. Choose this option only if no class constructors or destructors call virtual functions. Use /vd1 (default) to enable. Alternate: #pragma vtordisp
Best case ptrs /vmb Use best case “pointer to class member” representation. Use this option if you always define a class before you declare a pointer to a member of the class. The compiler will issue an error if it encounters a pointer declaration before the class is defined. Alternate: #pragma pointers_to_members
Gen. purpose ptrs /vmg Use general purpose “pointer to class member” representation (the opposite of /vmb). Required if you need to declare a pointer to a member of a class before defining the class. Requires one of the following inheritance models: /vmm, /vms, /vmv. Alternate: #pragma pointers_to_members

The Ultimate Compiler Settings

The ultimate options for fast programs. Microsoft default values for release builds are highlighted.

Name Option Description
Disable Vtable Init __declspec (novtable) Stops compiler from generating code to initialize the vfptr in the constructor. Apply to pure interface classes.
No Throw __declspec (nothrow) Stops compiler from tracking unwindable objects. Apply to functions that don’t throw exceptions. Recommend using the Std C exception specification throw() instead.
Pentium Pro /G6 Optimize for PentiumPro and above (program might not run on Pentium)
Windows /GA Optimize for Windows
Fastcall /Gr Fastest calling convention
String pooling RO /GF Merge duplicate strings into one read-only memory location
Disable RTTI /GR- Turn off run time type information.
Stack probes off /Gs Turn off stack checking
Exception handling /GX- Turns off exception handling (assumes program isn’t using excptn handling)
Func-level linking /Gy Only include functions that are referenced
Assume no aliasing /Oa Assume no aliasing occurs within functions
Inline any or inline expansion /Ob2 or /Ob1 Inline any function deemed appropriate by compiler or turn inlining on. Alternates: inline, __forceinline
Global opts /Og Full loop, common subexpression and register optimizations
Intrinsic functions /Oi Replaces specific functions with inline versions (memcpy, strcpy, etc.)
Fast code /Ot Favor code speed over size
Omit frame pointer /Oy Omit frame pointer
Ctor displacement /vd0 Disable constructor displacement.
Best case ptrs /vmb Use best case “pointer to class member” representation

Unsafe Optimizations Ahead – FYI

Don’t change your optimization settings recklessly. Although most settings would never cause your program to crash, there are some settings that should be used only when you know your code is conforming to Microsoft’s recommendations for that particular setting. The following table lists all potentially risky optimizations.

Name Option Notes
Pentium /G5 Code won’t run on 486 or below (use /GB instead)
Pentium Pro /G6 Code won’t run on Pentium or below (use /GB instead)
String pooling /Gf If a string is modified, it will be modified for any variable that points to it
String pooling RO /GF If a string is modified, memory exception occurs
Stack probes off /Gs A stack overflow will crash the program without an overflow error
Exception handling /GX- If exception handling is not enabled, an exception may crash the program
Assume no aliasing /Oa If there is aliasing in the program, the optimization can cause corrupted data
Inline expansion /Ob1 Inlines can cause unexpected code bloat and cache misses
Inline any /Ob2 Inlines can cause unexpected code bloat and cache misses
Intrinsic functions /Oi Intrinsic functions increase code size
Float consistency /Op Resulting floating-point code will be larger and slower
Ctor displacement /vd0 A virtual function may be passed an incorrect “this” pointer if it is invoked from within a constructor or destructor.
Struct packing /Zpn Can cause compatibility problems if packing is modified

6. Other advices

  • be conscious whether you’re bound by data or computation
  • prefer data contiguous in memory
  • if you can’t have contiguous data, prefer constant strides over randomness
  • keep data close together in space
  • keep accesses to same data close together in time
  • avoid dependencies between successive computations
  • avoid dependencies between two iterations of a loop
  • avoid ‘hard to predict’ branches
  • be aware of cache lines & alignment
  • minimize number of cache lines accessed by multiple threads
  • don’t be surprised by hardware weirdness (cache associativity, denormals)

7. Tools for profiling

  1. Use the profiler to see what part of application running slowly and then use the optimizing techniques. (valgrind –tool=callgrind)
  2. Familiarize yourself with linux perf tools and flame graphs (http://www.brendangregg.com/flamegraphs.html)
  3. Use tools like Intel Vtune
  4. Verify output assembly code – even if you don’t understand what it happens, usually smaller lines of code = faster (https://godbolt.org/)

 


8. Bookmarks