Triplet Challenge's goal is to encourage people to implement a small task in the programming language of their choice aiming at speed and low memory consumption. It borns from reading this post about the tech challenge of an interview. The output of this experiment is to see how even a simple challenge can be implemented in many different ways with a wide variety of results. It would be interesting to see how much we can squeeze of the CPU to get the program to complete its task as fast as possible while maintaining a low memory footprint.
If you have any doubts or suggestions about the challenge, please do not hesite and open an issue to discuss it.
First let's define what we are going to consider a triplet for this challenge: a triplet is a tuple of 3 consecutive words in the text.
- The program takes an UTF-8 file in English as the only parameter.
e.g.
./triplet_challenge pg2009.txt
. - The output of the program are the top 3 triplet of words used along with the
number of times they are repeated:
this is fine - 1337 call me maybe - 567 bring them back - 133
- Since we assume English, ignore any non alphanumeric character except
'
. This meansthis - is?= fine
should be interpreted asthis is fine
. - Triplets are case-insensitive. Hence,
CaLL mE mAyBE
andcall me maybe
should be considered as the same.
- You can use any programming language, as long as it can be run on Linux. Why Linux? Because the Continuous Integration system that we are all going to use will use a Linux machine. This is done like that so that we all play with the same context. Do take into account that using languages that require a virtual machine (e.g. Java, C#, Python, JavaScript, Ruby) will most likely result in a non-negligible startup time and overhead in terms of RAM usage. Feel free to tweak the parameters of the virtual machine to optimize that.
- You can't use any 3rd-party library for the task. Even a simple one. That
means, no
npm
,pip
,conan
, or whatever. The reasoning for this is that the task is small enough to be done with what the language offers and so that others can easily understand in the vanilla version of the language what's happening under the hoods without any extra knowledge of frameworks or libraries. - The file we will use to benchmark our program will be the free book The
Origin of Species by Means of Natural Selection by Charles
Darwin. The file is
already included in the repo, so there is no need to download it separately.
Its size is 1270822 bytes, containing 212199 words and 157858 triplets. The
output of
triplet_challenge
using this file in my solution is:of the same - 320 the same species - 130 conditions of life - 125
- Do not make any assumptions regarding the content or size of the file.
- Fork this repo.
- Sign-in to Travis CI using your GitHub credentials.
Enable the
triplet_challenge
project in Travis CI repositories so that it starts tracking the project to build new commits. If you don't see there yourtriplet_challenge
project yet, try theSync account
button on the top left. - Create a new branch for your work. e.g
git checkout -b java
. Do not commit to the master branch, so that you can easily fetch for updates from original repo, rebase, etc. Also, this makes it easier to others to add your remote to their already-existing repo and check out your work. - Implement your solution in the language of your choice. Use a different
directory for unit tests and the rest, since the
src
directory is the one that will be used (by default) to count the lines of code of your solution. Please ensure you license all your code with the GPLv3 license and give credit if you are modifying an existing solution. One of this exercise's goal is that people can learn from other's code and improve it, without taking advantage of it. - Make sure you are closing the file
pg2009.txt
only at the very end of your application, since this will be used to benchmark your algorithm. - Test your solution against the
pg2009.txt
file, write unit tests if needed, etc. - Once you are happy with your solution, modify the .travis.yml file that Travis CI will use to build and benchmark your solution. Take into account that all our solutions will use the same hardware running Ubuntu 16.04 on a 2-core Intel(R) Xeon(R) CPU @ 2.30GHz with 7.5GB of RAM. You may need a few attempts to get the Travis CI configuration right if you have never used it before. I'd suggest to use an Ubuntu 16.04 docker container to test which packages need to be downloaded.
- Push your new branch to your GitHub fork.
- Check the build on https://travis-ci.org/[user]/triplet_challenge/builds. Pay attention the log to see whether it succeded or not and check out the benchmark results.
- If you want to participate in the public ladder, create a PR to
the original
triplet_challenge
repo or open an issue with all the details needed to add a new row with your results. - Feel free to have a different solution for each category in the ladder. The easiest way is to create a new branch to compete in each category.
The script
part of Travis CI is the one that compiles your program. This is
the part you should modify, along with the addons:apt:packages
that you need
to install on Ubuntu. The after_success
part should never be modified, since
it ensures we all execute the benchmark with the same conditions. This part
does the following:
- Report the CPU info, in case it ever changes.
- Count the number of lines of code in the
src
directory using cloc. - Benchmark the speed using hyperfine.
We run the application 3 times before the benchmark to ensure the IO is
cached (warmp up runs) and then compute the average of the next 30 runs. In
this step we inject the
triplet_challenge_stats
library to hijack the
open
calls so that we measure the time your solution takes without considering the time it takes to boot the VM, initialize the runtime, etc. A previous implementation stopped the timer with theclose
function, but now the timer stops when the shared library is unloaded to make sure we all play using the same rules. e.g. Rust an its std::fs::read_to_string opens the file, reads its contents and closes it right away. - Benchmark the memory usage using GNU
time. We consider the
Maximum resident set size
as the value to compare against. As a side note, Valgrind Massif has been proven not to work well (or even work at all) in a wide variety of languages that use VM such as NodeJS, C#, Python. Even on languages with no VM and small runtime such as Rust it doesn't seem to work properly.
There are 3 main areas that I think would be interesting to compare: speed, memory and lines of code (without comments):
# | User | Language | Algorithm/Process Time | Memory | LOC | Source | Build |
---|---|---|---|---|---|---|---|
1 | pausan | C | 6.65/7.6 ms | 4.91 MiB | 650 | Repo | Build |
2 | caragones | C++ | 21.5/26.8 ms | 7.79 MiB | 377 | Repo | Build |
3 | pausan | Rust | 21.98/23.3 ms | 8.22 MiB | 205 | Repo | Build |
4 | pamarcos | Rust | 23.32/24.5 ms | 9.28 MiB | 141 | Repo | Build |
5 | caragones | C++ | 31.0/33.8 ms | 9.86 MiB | 442 | Repo | Build |
6 | pausan | Go | 37.2 ms | 15.84 MiB | 148 | Repo | Build |
7 | pamarcos | C++ | 102.03/104.2 ms | 24.37 MiB | 207 | Repo | Build |
8 | pamarcos | C++ | 118.43/121.2 ms | 12.81 MiB | 162 | Repo | Build |
9 | pausan | Go | 124.8 ms | 19.37 MiB | 84 | Repo | Build |
10 | pausan | Nim | 173.9 ms | 21.64 MiB | 70 | Repo | Build |
11 | pausan | Dart (native) | 176.4 ms | 56.6.64 MiB | 77 | Repo | Build |
12 | pausan | NodeJS | 295.8 ms | 88.772 MiB | 140 | Repo | Build |
13 | pausan | NodeJS | 363.5 ms | 70.728 MiB | 78 | Repo | Build |
14 | pausan | Python | 469.5 ms | 42.690 MiB | 66 | Repo | Build |
15 | javsanchez | C# | 848.6 ms | - | 100 | Repo | Build |
16 | javsanchez | Python | 1199 ms | 8.425 MiB | 47 | Repo | Build |
# | User | Language | Memory | Algorithm/Process Time | LOC | Source | Build |
---|---|---|---|---|---|---|---|
1 | pausan | C | 4.91 MiB | 6.65/7.6 ms | 650 | Repo | Build |
2 | caragones | C++ | 7.79 MiB | 21.5/26.8 ms | 377 | Repo | Build |
3 | pausan | Rust | 8.22 MiB | 21.98/23.3 ms | 205 | Repo | Build |
4 | javsanchez | Python | 8.425 MiB | 1199 ms | 47 | Repo | Build |
5 | pamarcos | Rust | 9.28 MiB | 23.32/24.5 ms | 141 | Repo | Build |
6 | caragones | C++ | 9.86 MiB | 31.0/33.8 ms | 442 | Repo | Build |
7 | pamarcos | C++ | 12.81 MiB | 118.43/121.2 ms | 162 | Repo | Build |
8 | pausan | Go | 15.84 MiB | 37.2 ms | 148 | Repo | Build |
9 | pausan | Go | 19.37 MiB | 124.8 ms | 84 | Repo | Build |
10 | pausan | Nim | 21.64 MiB | 173.9 ms | 70 | Repo | Build |
11 | pamarcos | C++ | 24.37 MiB | 102.03/104.2 ms | 207 | Repo | Build |
12 | pausan | Python | 42.690 MiB | 469.5 ms | 66 | Repo | Build |
13 | pausan | Dart (native) | 56.6.64 MiB | 176.4 ms | 77 | Repo | Build |
14 | pausan | NodeJS | 70.728 MiB | 363.5 ms | 78 | Repo | Build |
15 | pausan | NodeJS | 88.772 MiB | 295.8 ms | 140 | Repo | Build |
16 | javsanchez | C# | - | 848.6 ms | 100 | Repo | Build |
# | User | Language | LOC | Algorithm/Process Time | Memory | Source | Build |
---|---|---|---|---|---|---|---|
1 | luismedel | C# | 11 | N/A | N/A | Repo | N/A |
2 | javsanchez | Python | 47 | 1199 ms | 8.425 MiB | Repo | Build |
3 | pausan | Python | 66 | 469.5 ms | 42.690 MiB | Repo | Build |
4 | pausan | Nim | 70 | 173.9 ms | 21.64 MiB | Repo | Build |
5 | pausan | Dart (native) | 77 | 176.4 ms | 56.6.64 MiB | Repo | Build |
6 | pausan | NodeJS | 78 | 363.5 ms | 70.728 MiB | Repo | Build |
7 | pausan | Go | 84 | 124.8 ms | 19.37 MiB | Repo | Build |
8 | javsanchez | C# | 100 | 848.6 ms | - | Repo | Build |
9 | pamarcos | Rust | 141 | 23.32/24.5 ms | 9.28 MiB | Repo | Build |
10 | pausan | NodeJS | 140 | 295.8 ms | 88.772 MiB | Repo | Build |
11 | pausan | Go | 148 | 37.2 ms | 15.84 MiB | Repo | Build |
12 | pamarcos | C++ | 162 | 118.43/121.2 ms | 12.81 MiB | Repo | Build |
13 | pausan | Rust | 205 | 21.98/23.3 ms | 8.22 MiB | Repo | Build |
14 | pamarcos | C++ | 207 | 102.03/104.2 ms | 24.37 MiB | Repo | Build |
15 | caragones | C++ | 377 | 21.5/26.8 ms | 7.79 MiB | Repo | Build |
16 | caragones | C++ | 442 | 31.0/33.8 ms | 9.86 MiB | Repo | Build |
17 | pausan | C | 650 | 6.65/7.6 ms | 4.91 MiB | Repo | Build |
GPLv3
Copyright (C) 2020 Pablo Marcos Oltra
triplet_challenge is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
triplet_challenge is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with triplet_challenge. If not, see http://www.gnu.org/licenses/.