many-gf2n-cpp

Various implementations of GF(2^n), done in c++17. Intended as a practice (or something similar to a lab session)

Objective

~~Build a "vanilla" GF(2^n) module.~~
~~Use templates, constexpr & various metaprogramming techniques to pre-calculate exponents and logs in compile-time~~
- the way array< T,n > is constructed ensures we have initialized it in compile-time
- but is there a way to deassemble this and see for ourselves?
~~make sure that we will not compile if n > 16.~~
~~Find a way to use a constexpr if to change Rep depending on the length of genpoly~~
Find a way to check if genpoly is actually irreducible, and print a warning if it isn't. (i.e. if this "field" isn't actually a field)
Come up with a way to accurately measure performances
- define accurate
~~Wrap each in namespace~~
~~Write code to measure performance~~
Compare & suggest a plausible explanation for all observed phonomena.

Experiment Set 1 (make 20M pairs, multiply and add to `sum`)

Everything runs in a GCP e2-medium instance - 2 vCPUs, 4GB memory. Intel Broadwell CPU.
In all measurements, the generating polynomial is 0x11d, and gen(if needed) is 0x02
in 4_refactored.cpp, only the exp and log tables(relative to the generator element 0x02) are calculated in compile time & cached.
in 5_refactored_allcaches.cpp, the multiplication and division tables are calculated in compile time & cached.
the file starting with 4_1_ measures performance of 4_refactored.cpp
the file starting with 5_1_ measures performance of 5_refactored.allcaches.cpp
the results are as follows:

For each entry, timespans are measured in microseconds. The addition results are printed because calculation is sometimes omitted entirely with -O1 or -O2 flags (if we choose not to print this).

Comment

I fail to see how 20,000,000 GF2 multiplications+additions can be faster than 20,000,000 int multiplication+additions (with 5_ file and -O2 flag on)
- int multiplications are handled by one instruction, addition by another.
- GF2 multiplications are handled by at least one memory access, addition with at least one more.
- why & how is memory access faster than an instruction?
- have I missed something? is something being pipelined in that, and not in this?
- or have I messed up (possibly by using std::vector<pair< T,T >>)?

Experiment Set 2 (multiply 20M numbers to `sum`)

relevant: 4_3_betterbenchmark.cpp and 5_3_betterbenchmark.cpp (and the 4_2 and 5_2 files)
DoNotOptimize combats calculation elision
- this function allocates the value provided to a register
- this is an unobservable event, apparently - hence any alteration in the value provided should take place between two calls.
- see this answer at stackoverflow
convention is same as above.
Also, I tried to manually "unroll the loop", because 5_2 with O2 ran much slower than expected.
- See the code to see how I did this
the results are as follows:

Comment

Dramatic change when O0->O1.
- currently looking at this to see what's responsible for the 4x change in speed
- I've tried all flags listed there, but they don't seem to have a big effect, even when we apply all of them.
- g++ -S -fverbose-asm -std=c++17 5_3_betterbenchmark.cpp -o 53. nano 53, Ctrl+Q, find "optim".
- See 53_assemblynotes.md for an analysis on assembly code: Assembly codes alone aren't enough to explain the 4x difference in execution speed.
I would like to enable only inlining and run code again, but activating only a few (instead of whole O1) does not seem to work well - I get a link error.

Experiment Set 2-1 (Identify the factor responsible for the 4x difference)

Initial guess: the two measurements affect each other in some way
I switched the order of execution, this has little to no effect on anything: see 6_3_betterbenchmark.cpp
- The two measurements are pretty much independent!
Another guess: Instruction-level parallelism?
Another guess: Better exploitation of the memory hierarchy?
None of these make logical sense(see the stackoverflow question I wrote below)
Actually, I might need some input:
- Stackoverflow!
- Consensus from the comments is that out-of-order dynamic dispatch should be involved.

Experiment Set 2-2 (Gather convincing evidence for/against the OoO hypothesis)

See 6_3_*.cpps, compile with -O1 flag, run them: their performances saturate on 6_3_3.cpp, which unrolls the loop by a factor of 8.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
5_5_opts		5_5_opts
0_debug.txt		0_debug.txt
0_thiswontcompile.cpp		0_thiswontcompile.cpp
1_thiswill.cpp		1_thiswill.cpp
2_resultscached.cpp		2_resultscached.cpp
3_allresultscached.cpp		3_allresultscached.cpp
41and51.png		41and51.png
43and53.png		43and53.png
4_1_primitive_speedtest.cpp		4_1_primitive_speedtest.cpp
4_2_multonly.cpp		4_2_multonly.cpp
4_3_betterbenchmark.cpp		4_3_betterbenchmark.cpp
4_refactored.cpp		4_refactored.cpp
53_O0		53_O0
53_O1		53_O1
53_assemblynotes.md		53_assemblynotes.md
5_1_primitive_speedtest.cpp		5_1_primitive_speedtest.cpp
5_2_multonly.cpp		5_2_multonly.cpp
5_3_betterbenchmark.cpp		5_3_betterbenchmark.cpp
5_4_findrightflag.py		5_4_findrightflag.py
5_4_getO1flags.py		5_4_getO1flags.py
5_6_optreport.txt		5_6_optreport.txt
5_refactored_allcaches.cpp		5_refactored_allcaches.cpp
630.S		630.S
630_llvmmca.txt		630_llvmmca.txt
631.S		631.S
631_llvmmca.txt		631_llvmmca.txt
632.S		632.S
632.o		632.o
632_asm		632_asm
632_asm_original		632_asm_original
632_llvmmca.txt		632_llvmmca.txt
632helper.o		632helper.o
633.S		633.S
633_asm		633_asm
633_llvmmca.txt		633_llvmmca.txt
634.S		634.S
634_llvmmca.txt		634_llvmmca.txt
635.S		635.S
635_llvmmca.txt		635_llvmmca.txt
6_3_0.cpp		6_3_0.cpp
6_3_1.cpp		6_3_1.cpp
6_3_2.cpp		6_3_2.cpp
6_3_3.cpp		6_3_3.cpp
6_3_4.cpp		6_3_4.cpp
6_3_5.cpp		6_3_5.cpp
6_3_betterbenchmark.cpp		6_3_betterbenchmark.cpp
README.md		README.md
gppO0flags.txt		gppO0flags.txt
gppO1flags.txt		gppO1flags.txt
perf.data		perf.data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

many-gf2n-cpp

Objective

Experiment Set 1 (make 20M pairs, multiply and add to `sum`)

Comment

Experiment Set 2 (multiply 20M numbers to `sum`)

Comment

Experiment Set 2-1 (Identify the factor responsible for the 4x difference)

Experiment Set 2-2 (Gather convincing evidence for/against the OoO hypothesis)

Experiment Set 3 (modify assembly a tiny bit, reassemble, link, execute)

About

Releases

Packages

Languages

stet-stet/many-gf2n-cpp

Folders and files

Latest commit

History

Repository files navigation

many-gf2n-cpp

Objective

Experiment Set 1 (make 20M pairs, multiply and add to sum)

Comment

Experiment Set 2 (multiply 20M numbers to sum)

Comment

Experiment Set 2-1 (Identify the factor responsible for the 4x difference)

Experiment Set 2-2 (Gather convincing evidence for/against the OoO hypothesis)

Experiment Set 3 (modify assembly a tiny bit, reassemble, link, execute)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Experiment Set 1 (make 20M pairs, multiply and add to `sum`)

Experiment Set 2 (multiply 20M numbers to `sum`)

Packages