-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue w/ new Text
instances
#9
Comments
Knowing nothing about what causes this issue, something to note is that With that it mind, do you still see this issue after using the code in PR #8? |
Knowing nothing about what causes this issue, something to note is that `regex-tdfa` supplies `SPECIALIZE` pragmas for `String`, both strict and lazy `ByteString`, and `Seq Char`. Notably, it does not provide `SPECIALIZE` pragmas for strict or lazy `Text`.
With that it mind, do you still see this issue after using the code in PR #8?
I'm afraid so.
I didn't have time yet to investigate more yet, but maybe you can?
The code below clearly shows the slowdown; its output for me is
```
== ByteString
92
email len : 0.575371484s
()
email rnf : 0.589730101s
== String
92
email len : 0.527972566s
()
email rnf : 0.538322764s
== Text
92
email len : 0.542759858s
()
email rnf : 29.904703049s
```
Counting the matches is equally fast ("email len"); the trouble starts when actually generating the match strings ("email rnf"), showing a huge 60-fold regression for the email regexp.
I commented out the other test-regexps as the email-regexp is enough to demonstrate the issue very clearly:
```haskell
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# OPTIONS_GHC -Wall #-}
{- cabal:
build-depends: time, text, deepseq, bytestring, regex-tdfa, regex-base
-}
module Main where
import Control.DeepSeq
import Control.Exception
import Control.Monad ()
import qualified Data.ByteString as BS
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.Text.IO as T
import Data.Time.Clock.POSIX
import Text.Regex.TDFA
main :: IO ()
main = do
!txt <- T.readFile "input-text.txt"
putStrLn "== ByteString"
!bs <- evaluate (T.encodeUtf8 txt)
doBench (bs :: BS.ByteString)
putStrLn "== String"
str <- evaluate (force (T.unpack txt))
doBench (str :: [Char])
putStrLn "== Text"
doBench (txt :: Text)
{-# NOINLINE doBench #-}
doBench :: forall text . (Show text, NFData text, RegexLike Regex text) => text -> IO ()
doBench txt = do
timeit "email len" $ print (length (getAllTextMatches (txt =~ reEmail) :: [text]))
-- timeit "URI len" $ print (length (getAllTextMatches (txt =~ reURI) :: [text]))
-- timeit "IPv4 len" $ print (length (getAllTextMatches (txt =~ reIPv4) :: [text]))
timeit "email rnf" $ print (rnf (getAllTextMatches (txt =~ reEmail) :: [text]))
-- timeit "URI rnf" $ print (rnf (getAllTextMatches (txt =~ reURI) :: [text]))
-- timeit "IPv4 rnf" $ print (rnf (getAllTextMatches (txt =~ reIPv4) :: [text]))
where
reEmail = "[_a-zA-Z0-9\\.+-]+@[_a-zA-Z0-9\\.-]+\\.[_a-zA-Z0-9\\.-]+"
-- reURI = "[_a-zA-Z0-9]+://[^/\\s?#]+[^\\s?#]+(\\?[^\\s#]*)?(#[^\\s]*)?"
-- reIPv4 = "((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])"
timeit :: String -> IO a -> IO a
timeit msg act = do
!t0 <- getPOSIXTime
!res <- act
!t1 <- getPOSIXTime
putStrLn (msg ++ " : " ++ show (t1-t0))
pure res
```
|
Hmm, that looks sticky. I have no particular knowledge that makes it likely that I can solve this problem, but if I have time I might have a poke around. |
Is there any update on this? Or easy way I could help? Background: I originally used text-tdfa on gwern.net for regexp tests & rewrites back in 2019 or so with the Text support, but it was so slow I moved to the basic Posix regex library; unfortunately, over the past 2 years I've been running into ever more segfaults and 'strange closures' (GHC would even segfault while compiling the code!), which generally seemed to be related to the regular regexes, which have a long trail of weird issues being reported over the years. Moving back to tdfa (with the |
@gwern : I am not planning on working on this, but a high-quality PR is definitely welcome. |
One thing to note is that there are some pretty big changes in the text package which will be landing soon: haskell/text#365, haskell/text#348. The second in particular might have a large effect on any work here. Maybe disabling those rules will solve the problem? Maybe it will not, but will defeat any solution which is derived in the meantime. I think looking at the effect of those would be a good first thing to do. |
Have those already been released? I see on Hackage the last |
It's complicated slightly by the fact that |
When I tried benchmarking, I noticed a strange and significant performance-penalty effect that is exposed by the
Text
support recently merged intoregex-tdfa
. (TheString
andByteString
instances don't exhibit this performance effect)It turns out that the
regex-tdfa-text
package already suffered from this but nobody seems to have reported it yet and it's probably not too apparent in simple regex-matching applications. Since theText
support is fundamentally more or less the same code as theByteString
implementation, I'm suspecting a weird interaction with the fusion-rules intext
.TODO: Provide repro-case
The text was updated successfully, but these errors were encountered: