Performance issue w/ new `Text` instances #9

hvr · 2019-12-08T22:21:10Z

When I tried benchmarking, I noticed a strange and significant performance-penalty effect that is exposed by the Text support recently merged into regex-tdfa. (The String and ByteString instances don't exhibit this performance effect)

It turns out that the regex-tdfa-text package already suffered from this but nobody seems to have reported it yet and it's probably not too apparent in simple regex-matching applications. Since the Text support is fundamentally more or less the same code as the ByteString implementation, I'm suspecting a weird interaction with the fusion-rules in text.

TODO: Provide repro-case

The text was updated successfully, but these errors were encountered:

Xitian9 · 2020-08-15T01:22:57Z

Knowing nothing about what causes this issue, something to note is that regex-tdfa supplies SPECIALIZE pragmas for String, Seq Char, and both strict and lazy ByteString. Notably, it does not provide SPECIALIZE pragmas for strict or lazy Text.

With that it mind, do you still see this issue after using the code in PR #8?

hvr · 2020-08-15T07:56:45Z

Knowing nothing about what causes this issue, something to note is that `regex-tdfa` supplies `SPECIALIZE` pragmas for `String`, both strict and lazy `ByteString`, and `Seq Char`. Notably, it does not provide `SPECIALIZE` pragmas for strict or lazy `Text`. With that it mind, do you still see this issue after using the code in PR #8?

I'm afraid so. I didn't have time yet to investigate more yet, but maybe you can? The code below clearly shows the slowdown; its output for me is ``` == ByteString 92 email len : 0.575371484s () email rnf : 0.589730101s == String 92 email len : 0.527972566s () email rnf : 0.538322764s == Text 92 email len : 0.542759858s () email rnf : 29.904703049s ``` Counting the matches is equally fast ("email len"); the trouble starts when actually generating the match strings ("email rnf"), showing a huge 60-fold regression for the email regexp. I commented out the other test-regexps as the email-regexp is enough to demonstrate the issue very clearly: ```haskell {-# LANGUAGE BangPatterns #-} {-# LANGUAGE FlexibleContexts #-} {-# LANGUAGE ScopedTypeVariables #-} {-# OPTIONS_GHC -Wall #-} {- cabal: build-depends: time, text, deepseq, bytestring, regex-tdfa, regex-base -} module Main where import Control.DeepSeq import Control.Exception import Control.Monad () import qualified Data.ByteString as BS import Data.Text (Text) import qualified Data.Text as T import qualified Data.Text.Encoding as T import qualified Data.Text.IO as T import Data.Time.Clock.POSIX import Text.Regex.TDFA main :: IO () main = do !txt <- T.readFile "input-text.txt" putStrLn "== ByteString" !bs <- evaluate (T.encodeUtf8 txt) doBench (bs :: BS.ByteString) putStrLn "== String" str <- evaluate (force (T.unpack txt)) doBench (str :: [Char]) putStrLn "== Text" doBench (txt :: Text) {-# NOINLINE doBench #-} doBench :: forall text . (Show text, NFData text, RegexLike Regex text) => text -> IO () doBench txt = do timeit "email len" $ print (length (getAllTextMatches (txt =~ reEmail) :: [text])) -- timeit "URI len" $ print (length (getAllTextMatches (txt =~ reURI) :: [text])) -- timeit "IPv4 len" $ print (length (getAllTextMatches (txt =~ reIPv4) :: [text])) timeit "email rnf" $ print (rnf (getAllTextMatches (txt =~ reEmail) :: [text])) -- timeit "URI rnf" $ print (rnf (getAllTextMatches (txt =~ reURI) :: [text])) -- timeit "IPv4 rnf" $ print (rnf (getAllTextMatches (txt =~ reIPv4) :: [text])) where reEmail = "[_a-zA-Z0-9\\.+-]+@[_a-zA-Z0-9\\.-]+\\.[_a-zA-Z0-9\\.-]+" -- reURI = "[_a-zA-Z0-9]+://[^/\\s?#]+[^\\s?#]+(\\?[^\\s#]*)?(#[^\\s]*)?" -- reIPv4 = "((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])" timeit :: String -> IO a -> IO a timeit msg act = do !t0 <- getPOSIXTime !res <- act !t1 <- getPOSIXTime putStrLn (msg ++ " : " ++ show (t1-t0)) pure res ```

Xitian9 · 2020-08-15T10:26:43Z

Hmm, that looks sticky. I have no particular knowledge that makes it likely that I can solve this problem, but if I have time I might have a poke around.

gwern · 2022-04-26T17:14:45Z

Is there any update on this? Or easy way I could help?

Background: I originally used text-tdfa on gwern.net for regexp tests & rewrites back in 2019 or so with the Text support, but it was so slow I moved to the basic Posix regex library; unfortunately, over the past 2 years I've been running into ever more segfaults and 'strange closures' (GHC would even segfault while compiling the code!), which generally seemed to be related to the regular regexes, which have a long trail of weird issues being reported over the years. Moving back to tdfa (with the regex-compat-tdfa compat layer for the rewrites) mostly seems to resolve the constant segfaulting, but also slows down site compile time by like 2x (from ~1.5h to >3h). Before I spend a lot of time optimizing by hand, it'd be nice if text-tdfa fixed its lowhanging fruit.

andreasabel · 2022-04-26T20:57:57Z

@gwern : I am not planning on working on this, but a high-quality PR is definitely welcome.
My main concern is to keep regex-tdfa working stably in an evolving Haskell ecosystem. But improvements are welcome...

Xitian9 · 2022-04-26T23:15:17Z

One thing to note is that there are some pretty big changes in the text package which will be landing soon: haskell/text#365, haskell/text#348. The second in particular might have a large effect on any work here. Maybe disabling those rules will solve the problem? Maybe it will not, but will defeat any solution which is derived in the meantime. I think looking at the effect of those would be a good first thing to do.

gwern · 2022-04-26T23:28:24Z

Have those already been released? I see on Hackage the last text upload date is 2021-12-24 but the UTF8 change was merged 2021-09-08 and the implicit-fusion disabling 2021-06-21, if I'm reading these bug reports correctly, so shouldn't both of them already be in live text, not just dev HEAD?

Xitian9 · 2022-04-27T03:24:17Z

It's complicated slightly by the fact that text is a boot package, and so is bundled with ghc. You can access it now on hackage, but it won't be the default version you get when you compile (without some extra contortions) until ghc-9.4.

hvr added the bug Something isn't working label Dec 8, 2019

andreasabel added the performance label Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue w/ new `Text` instances #9

Performance issue w/ new `Text` instances #9

hvr commented Dec 8, 2019

Xitian9 commented Aug 15, 2020 •

edited

Loading

hvr commented Aug 15, 2020 via email •

edited by andreasabel

Loading

Xitian9 commented Aug 15, 2020

gwern commented Apr 26, 2022

andreasabel commented Apr 26, 2022

Xitian9 commented Apr 26, 2022

gwern commented Apr 26, 2022

Xitian9 commented Apr 27, 2022

Performance issue w/ new Text instances #9

Performance issue w/ new Text instances #9

Comments

hvr commented Dec 8, 2019

Xitian9 commented Aug 15, 2020 • edited Loading

hvr commented Aug 15, 2020 via email • edited by andreasabel Loading

Xitian9 commented Aug 15, 2020

gwern commented Apr 26, 2022

andreasabel commented Apr 26, 2022

Xitian9 commented Apr 26, 2022

gwern commented Apr 26, 2022

Xitian9 commented Apr 27, 2022

Performance issue w/ new `Text` instances #9

Performance issue w/ new `Text` instances #9

Xitian9 commented Aug 15, 2020 •

edited

Loading

hvr commented Aug 15, 2020 via email •

edited by andreasabel

Loading