Skip to content
This repository has been archived by the owner on Nov 11, 2018. It is now read-only.

Parsing is slow relative to LLVM #6

Closed
pwaller opened this issue Oct 8, 2018 · 18 comments
Closed

Parsing is slow relative to LLVM #6

pwaller opened this issue Oct 8, 2018 · 18 comments

Comments

@pwaller
Copy link

pwaller commented Oct 8, 2018

I have a 44kloc input ir/testdata/coreutils/vdir.ll, and github.com/mewmew/l/asm.Parse(input) is taking 8 CPU-seconds (4s wall). opt -verify < ir/testdata/coreutils/vdir.ll takes 250ms (280ms wall).

The relative slowdown is ~30x.

Is performance a goal of this project? I'm concerned about what this will look like when I feed it larger inputs.

@pwaller
Copy link
Author

pwaller commented Oct 8, 2018

Interesting, turning the garbage collector off with GOGC=off reduces CPU time by a half, and wall clock by 12%.

70% of CPU time is spent in mapassign, as best as I can tell simply executing this line:

https://github.com/mewmew/l/blob/2f2fc8d9956b16f583893cb6bf27b02311d285b2/ir/irutil/walk.go#L95

@pwaller
Copy link
Author

pwaller commented Oct 8, 2018

If I turn module resolution off (the bulk of the code in Translate()):

real    0m0.898s
user    0m1.623s
sys     0m0.509s

If I turn module resolution off and set GOGC=off:

real    0m0.790s
user    0m0.512s
sys     0m0.127s

@mewmew
Copy link
Member

mewmew commented Oct 8, 2018

Hi @pwaller,

Thanks for the feedback! It's great that you've started to profile and locate some of the performance bottlenecks. Yes, map access is a common concern, and can often be solved rather easily. A similar case was identified in a hand-written lexer that was part of llir/llvm before we switched to the gocc generated one. @quarnster submitted llir/llvm#11 which improved the performance quite substantially using binary search instead of naive search.

I think there are also several other low-hanging fruits to improve the performance of the parsing. The main one is related to the parser generated by gocc, as it currently contains a huge state space. One proposed solution to this problem is to implement Pager's practical general method for constructing LR(k) parsers. This is tracked by goccmack/gocc#34.

Is performance a goal of this project? I'm concerned about what this will look like when I feed it larger inputs.

The goal of this project is to be roughly on pair with the performance of the official LLVM project. Perhaps ~2x slowdown is ok, but definitely not >= 10x slowdown.

That being said, the immediate goal is to complete llir/llvm#29 and merge mewmew/l back into llir/llvm. Then once llir/llvm supports read/write of the entire LLVM IR language, we will switch focus to performance optimizations.

Cheers,
Robin

@mewmew
Copy link
Member

mewmew commented Oct 12, 2018

Currently evaluating using Textmapper instead of Gocc to generate the lexer and parser, hopefully this may help improve the performance.

In inspirer/textmapper#6 (comment) @inspirer mentions the following:

On real-world languages, generated parsers in Go gave me ~100-230MB/sec of lexing throughtput and 20-60MB/sec of parsing throughput. It gets slightly better with each Go release, mostly because of improved register allocation within the Go compiler.

The evaluation is tracked in this repo: https://github.com/mewmew/l-tm

@mewmew
Copy link
Member

mewmew commented Oct 13, 2018

The grammar has now rewritten to Textmapper (no semantic actions yet though). And the initial performance results are great!

As a benchmark I tried parsing the LLVM IR assembly produced when compiling Coreutils with Clang. The LLVM IR files are located at https://github.com/decomp/testdata/tree/master/coreutils/testdata

And the benchmark results are reported at https://github.com/mewmew/l-tm/blob/master/BENCH.md

Parsing 1,733,842 lines and 135 MB of LLVM IR assembly, as contained in the 107 source files at decomp/testdata took ~3 seconds; thus ~30ms was used per file, or ~45 MB/s.

Notice, these results is only from shift/reduce parsing. But no semantic actions are yet used to produce the AST, resolve identifiers, do type checking, etc. That is yet to be done. But, as for parsing. I think Textmapper makes a good choice for performance.

@mewmew
Copy link
Member

mewmew commented Oct 15, 2018

@pwaller as of mewspring/l-tm#1 we are now able to construct an AST for the entire LLVM IR language as defined by LLVM 7.0 (except for modules with SummaryEntry which I have yet to see in the wild, if you have one, please send it my way!).

The step to translate the AST to the IR representation is yet to be done.

However, this gives us an indication of the performance we can expect.

Official LLVM:

u@x1 ~/D/g/s/g/m/l/c/l-tm> time opt -verify < testdata/vdir.ll > /dev/null
real 0.25
user 0.23
sys 0.01

llir using parser generated by Textmapper:

u@x1 ~/D/g/s/g/m/l/c/l-tm> time l-tm testdata/vdir.ll > /dev/null
real 0.21
user 0.19
sys 0.06

To replicate these steps, do as follows:

Download and install Textmapper. The tool is currently written in Java and generates parsers in Go (similar to ANTLR). @inspirer is working on porting the Textmapper tool to Go (see inspirer/textmapper#6).

$ go get -v -u github.com/inspirer/textmapper
$ cd $GOPATH/src/github.com/inspirer/textmapper/tm-tool
$ ant deploy

go get mewmew/l-tm and generate the LLVM IR parser from the grammar using Textmapper.

$ go get -v -u github.com/mewmew/l-tm
$ cd $GOPATH/src/github.com/mewmew/l-tm/asm/ll
$ make
lalr: 0.286s, text: 0.541s, parser: 2163 states, 186KB
$ go get -v -u github.com/mewmew/l-tm/cmd/l-tm

Now, use l-tm to parse LLVM IR assembly files produced by the LLVM 7.0 toolchain.

$ time l-tm coreutils/vdir.ll 
=== [ coreutils/vdir.ll ] =======================

took: 184.894959ms

entity *ast.SourceFilename: source_filename = "llvm-link"
   name: "llvm-link"
entity *ast.TargetDataLayout: target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
   datalayout: "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
...

real	0m0.727s
user	0m0.194s
sys	0m0.190s

And without printouts:

$ time l-tm coreutils/vdir.ll > /dev/null

real	0m0.207s
user	0m0.175s
sys	0m0.075s

@pwaller
Copy link
Author

pwaller commented Oct 15, 2018

Impressive, great work! I watch with interest. I assume this is without any kind of resolution?

Do you have a plan for attacking the symbol resolution problem?

I have this sense that the current code must be descending into nodes far more often than necessary. Also, because of the use of interfaces it must be doing a lot of work. I'd be interested to hack together some generated code which does this as a side project, I might have some time for that on Friday 19th, perhaps.

@mewmew
Copy link
Member

mewmew commented Oct 15, 2018

Impressive, great work! I watch with interest. I assume this is without any kind of resolution?

Indeed, this is before resolution.

Do you have a plan for attacking the symbol resolution problem?

I'm open to suggestions :) How would you attack the problem?

Basically, I learn as I go. Have learned a great deal about LR, LALR, Pager's LR(k), etc these last few weeks. The aim of Textmapper is more or less precisely what I was hoping for, basically ANTLR but using LR instead of LL and Go instead of Java.

I have this sense that the current code must be descending into nodes far more often than necessary. Also, because of the use of interfaces it must be doing a lot of work. I'd be interested to hack together some generated code which does this as a side project, I might have some time for that on Friday 19th, perhaps.

Definitely. I'd love to hack on this with you. Friday the 19th, it's a date!

@mewmew
Copy link
Member

mewmew commented Oct 15, 2018

@pwaller We now have type resolution of type definitions (in the translate branch).

I've added a -types flag to enable/disable type resolution of type definitions.

Note: I'm translating to the IR of llir/l which is basically a merge between llir/llvm/ir and mewmew/l/ir. The performance should be similar regardless of which of these three we translate to.

The preliminary results are as follows:

Type resolution of type definitions enabled

$ time l-tm -types=true testdata/vdir.ll
=== [ testdata/vdir.ll ] =======================

parsing into AST took: 172.208673ms

type resolution of type definitions took: 4.56632ms

total time for file "testdata/vdir.ll": 176.821143ms
real 0.18
user 0.17
sys 0.05

$ time l-tm -types=true testdata/vdir.ll
=== [ testdata/vdir.ll ] =======================

parsing into AST took: 162.403211ms

type resolution of type definitions took: 4.881672ms

total time for file "testdata/vdir.ll": 167.355645ms
real 0.17
user 0.17
sys 0.03

$ time l-tm -types=true testdata/vdir.ll
=== [ testdata/vdir.ll ] =======================

parsing into AST took: 183.710061ms

type resolution of type definitions took: 5.144666ms

total time for file "testdata/vdir.ll": 188.905968ms
real 0.19
user 0.18
sys 0.05

Type resolution of type definitions disabled

$ time l-tm -types=false testdata/vdir.ll
=== [ testdata/vdir.ll ] =======================

parsing into AST took: 187.108077ms

total time for file "testdata/vdir.ll": 187.143403ms
real 0.19
user 0.18
sys 0.05

$ time l-tm -types=false testdata/vdir.ll
=== [ testdata/vdir.ll ] =======================

parsing into AST took: 186.322634ms

total time for file "testdata/vdir.ll": 186.357051ms
real 0.19
user 0.19
sys 0.05

$ time l-tm -types=false testdata/vdir.ll
=== [ testdata/vdir.ll ] =======================

parsing into AST took: 182.948167ms

total time for file "testdata/vdir.ll": 182.98329ms
real 0.19
user 0.19
sys 0.03

Running on all 107 LLVM IR files of Coreutils

Type resolution of type definitions enabled

$ time l-tm -types=true testdata/*.ll > /dev/null
real 3.81
user 4.09
sys 0.15

$ time l-tm -types=true testdata/*.ll > /dev/null
real 3.85
user 4.18
sys 0.12

$ time l-tm -types=true testdata/*.ll > /dev/null
real 3.85
user 4.12
sys 0.17

Type resolution of type definitions disabled

$ time l-tm -types=false testdata/*.ll > /dev/null
real 3.74
user 4.05
sys 0.12

$ time l-tm -types=false testdata/*.ll > /dev/null
real 3.72
user 4.05
sys 0.09

$ time l-tm -types=false testdata/*.ll > /dev/null
real 3.72
user 4.05
sys 0.11

@pwaller
Copy link
Author

pwaller commented Oct 16, 2018

😮 nice! 💯

Is this now doing more or less equivalent work as when I made the original report, or are there any other major bits missing?

@mewmew
Copy link
Member

mewmew commented Oct 16, 2018

Is this now doing more or less equivalent work as when I made the original report, or are there any other major bits missing?

Oh, no.. Much work left to do. It's a start though :) This is the list of things to do to translate the AST to IR, roughly:

For entire module

Identifier resolution and translation in module scope.

  • identifier resolution

    • type name resolution (local identifiers used in module scope; e.g. %t)
    • global variable and function name resolution (global identifiers used in module scope; e.g. @foo, @42)
    • function attribute group ID resolution (attribute group IDs used in module scope; e.g. #42)
    • Comdat name resolution (Comdat identifiers used in module scope; e.g. $foo)
    • named metadata resolution (metadata names used in module scope; e.g. !foo)
    • metadata ID resolution (metadata IDs used in module scope; e.g. !42)
  • translation from AST to IR representation

    • translation of type definitions
    • translation of global declarations and definitions
    • translation of constants
      • note to self: the blockaddress constant must be translated after basic blocks have been added to functions, and local variables have been assigned names. However, since blockaddress may be used within any function, this may create a dependency relationship that will be a bit tricky to solve in a simple fashion when translating from AST to IR. Will have to give it some thought.
      • a potential solution:
        1. index global identifiers (global variables and functions)
        2. translate function bodies (handle blockaddress with care, as detailed below)
          • translate blockaddress to &ir.ConstBlockAddress{Func: f, Block: &ir.BasicBlock{LocalName: local(old.Name())}}, the point being that we create a dummy IR basic block, that will record the basic block name used in the blockaddress constant. Then, once function bodies have been translated, run f.AssignIDs() to assign local IDs to unnamed instruction results, basic blocks and function parameters. As far as I can tell, this would solve the dependency relationship mentioned above.
        3. translate global init constants
    • translation of constant expressions
    • translation of function headers
    • translation of named metadata definitions
    • translation of metadata definitions
    • translation of specialized metadata nodes (note: the final IR representation of this data (e.g. DWARF metadata) is not yet known. We would like to find an IR representation that is easy to work with. Any suggestions are warmly appreciated)
    • translation of indirect symbols (alias definitions and IFuncs)
    • translation of source filename, target definitions, module-level inline assembly, Comdat definitions, attribute group definitions, use-list order directives, inline assembly values (note: all of these are small and may be completed very quickly)

For each function

Identifier resolution and translation in function scope.

  • identifier resolution

    • function parameter, result of instruction and basic block resolution (local identifiers used in function scope; e.g. %foo, %42)
  • translation from AST to IR representation

    • translation of basic blocks
    • translation of instructions
    • translation of terminators

As you can see, there are a lot of steps. And any of these steps may become the bottle neck if implemented poorly. I'm fairly certain that was the case last time around. And given your profiling, it seems identifier resolution killed the performance by walking the AST too many times (and the AST walker tracked visited nodes to avoid infinite recursion).

So, any ideas on how to better implement this is also warmly welcome.

@mewmew
Copy link
Member

mewmew commented Oct 16, 2018

@pwaller would you like to participate in the future development of llir/llvm more closely? I can see from your previous involvement that you are quite persistent, which is required for a project of this size and ambition.

I would be glad to invite you to the llir GitHub organization, and we can bounce some ideas on how to move things forward, and more importantly what the end goals of the library should be. I'd still be very interested to hear your big ideas :)

Also, do note that I don't expect you to work on this more than you want and understand that you have commitments both in personal life, work, time away from screen, etc. We all do :) So, please consider this a hobby project with a serious ambition in the long term. I would be very glad to invite you onboard, should you choose to join!

Cheerful regards,
Robin

Edit: and not to scare you with the big list above. I'll go ahead and implement most items in the weeks/months to come (so the above invitation was not to be interpreted as hey, I got this huge list, would you like to do it for me). I have a few projects I'm working on that would love to have this well implemented, so have to iterate a few times.

@pwaller
Copy link
Author

pwaller commented Oct 16, 2018

Thanks for the very nicely written message. I am uncertain how much I'll be able to contribute, other than by doing as I am already doing. Let's have a chat on Friday! :)

@mewmew
Copy link
Member

mewmew commented Oct 16, 2018

Let's have a chat on Friday! :)

Sounds good! When may I catch you on Friday?

@pwaller
Copy link
Author

pwaller commented Oct 17, 2018

I sent you a message about it on the 11th around 0900 UTC and 15th around 1700 UTC. Might they have gotten stuck in a spam filter?

@mewmew
Copy link
Member

mewmew commented Oct 17, 2018

I sent you a message about it on the 11th around 0900 UTC and 15th around 1700 UTC. Might they have gotten stuck in a spam filter?

Sent a reply :)

mewmew added a commit to mewspring/l-tm that referenced this issue Oct 18, 2018
ref: mewspring/mewmew-l#6 (comment)

Translate blockaddress to `&ir.ConstBlockAddress{Func: f,
Block: &ir.BasicBlock{LocalName: local(old.Name())}}`, the
point being that we create a dummy IR basic block, that
will record the basic block name used in the blockaddress
constant. Then, once function bodies have been translated,
run `f.AssignIDs()` to assign local IDs to unnamed
instruction results, basic blocks and function parameters.
As far as I can tell, this would solve the dependency
relationship mentioned above.

Updates #6.
@mewmew
Copy link
Member

mewmew commented Nov 9, 2018

As a notice, this issue has been superseded by llir/llvm#34

@mewmew mewmew closed this as completed Nov 9, 2018
@mewmew
Copy link
Member

mewmew commented Nov 10, 2018

We managed to achieve ~1.6x slowdown compared to the official LLVM version. This is before trying to profile and further optimize llir/llvm. The main performance gains comes from using Textmapper instead of Gocc, and of course not relying on the IR walker (and it's expensive map access) to handle translation from AST to IR.

In relation, the dev branch of llir/llvm has now surpassed the capabilities of llir/l, mewmew/l and the old llir/llvm master branch. Therefore, and to avoid confusion for future development, all repos except llir/llvm have been depricated (as was the intention all along, to have them be experimental until we had gained enough knowledge to integrate back into llir/llvm). As such, these repos have marked as archived (and moved to mewspring).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants