Various implementations of GF(2^n), done in c++17. Intended as a practice (or something similar to a lab session)
Build a "vanilla" GF(2^n) module.Use templates, constexpr & various metaprogramming techniques to pre-calculate exponents and logs in compile-time- the way array< T,n > is constructed ensures we have initialized it in compile-time
- but is there a way to deassemble this and see for ourselves?
make sure that we will not compile if n > 16.Find a way to use a constexpr if to changeRep
depending on the length of genpoly- Find a way to check if genpoly is actually irreducible, and print a warning if it isn't. (i.e. if this "field" isn't actually a field)
- Come up with a way to accurately measure performances
- define accurate
Wrap each in namespaceWrite code to measure performance- Compare & suggest a plausible explanation for all observed phonomena.
- Everything runs in a GCP e2-medium instance - 2 vCPUs, 4GB memory. Intel Broadwell CPU.
- In all measurements, the generating polynomial is 0x11d, and gen(if needed) is 0x02
- in
4_refactored.cpp
, only the exp and log tables(relative to the generator element 0x02) are calculated in compile time & cached. - in
5_refactored_allcaches.cpp
, the multiplication and division tables are calculated in compile time & cached. - the file starting with
4_1_
measures performance of4_refactored.cpp
- the file starting with
5_1_
measures performance of5_refactored.allcaches.cpp
- the results are as follows:
For each entry, timespans are measured in microseconds.
The addition results are printed because calculation is sometimes omitted entirely with -O1
or -O2
flags (if we choose not to print this).
- I fail to see how 20,000,000 GF2 multiplications+additions can be faster than 20,000,000 int multiplication+additions (with
5_
file and-O2
flag on)- int multiplications are handled by one instruction, addition by another.
- GF2 multiplications are handled by at least one memory access, addition with at least one more.
- why & how is memory access faster than an instruction?
- have I missed something? is something being pipelined in that, and not in this?
- or have I messed up (possibly by using std::vector<pair< T,T >>)?
- relevant:
4_3_betterbenchmark.cpp
and5_3_betterbenchmark.cpp
(and the4_2
and5_2
files) DoNotOptimize
combats calculation elision- this function allocates the value provided to a register
- this is an unobservable event, apparently - hence any alteration in the value provided should take place between two calls.
- see this answer at stackoverflow
- convention is same as above.
- Also, I tried to manually "unroll the loop", because
5_2
with O2 ran much slower than expected.- See the code to see how I did this
- the results are as follows:
-
Dramatic change when O0->O1.
- currently looking at this to see what's responsible for the 4x change in speed
- I've tried all flags listed there, but they don't seem to have a big effect, even when we apply all of them.
g++ -S -fverbose-asm -std=c++17 5_3_betterbenchmark.cpp -o 53
.nano 53
, Ctrl+Q, find "optim".- See
53_assemblynotes.md
for an analysis on assembly code: Assembly codes alone aren't enough to explain the 4x difference in execution speed.
-
I would like to enable only inlining and run code again, but activating only a few (instead of whole
O1
) does not seem to work well - I get a link error.
- Initial guess: the two measurements affect each other in some way
- I switched the order of execution, this has little to no effect on anything: see
6_3_betterbenchmark.cpp
- The two measurements are pretty much independent!
- Another guess: Instruction-level parallelism?
- Another guess: Better exploitation of the memory hierarchy?
- None of these make logical sense(see the stackoverflow question I wrote below)
- Actually, I might need some input:
- Stackoverflow!
- Consensus from the comments is that out-of-order dynamic dispatch should be involved.
- See
6_3_*.cpp
s, compile with-O1
flag, run them: their performances saturate on6_3_3.cpp
, which unrolls the loop by a factor of 8.