Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: symlink/symbolic link for faster/smaller compiled site versions #786

Open
gwern opened this issue Jul 22, 2020 · 12 comments · Fixed by #810
Open

Feature request: symlink/symbolic link for faster/smaller compiled site versions #786

gwern opened this issue Jul 22, 2020 · 12 comments · Fixed by #810

Comments

@gwern
Copy link
Contributor

gwern commented Jul 22, 2020

I would like symlinkCompiler which does symbolic links (or hard links) as a dropin replacement for a standard static file copying routine like my let static = route idRoute >> compile copyFileCompiler, which would be a performance optimization for compiling many large static files.

As gwern.net gets larger, particularly with audio/images/videos generated for my deep learning experiments, compiling it spends increasingly more time and disk space creating _site/. Even with a NVMe SSD, the time starts to add up; more problematically, I'm starting to run out of disk space for creating 40GB _site/ folders just to upload a few modified files & then delete it. Almost all of that disk space & IO is going to copying things like PDFs or MP4s from one folder to another. There's no particular reason those copies couldn't just be symbolic or hard links back to the original file and then I can use rsync with --copy-links to have rsync follow the links when it syncs with my gwern.net server.

Looking at the File.hs module which defines copyFileCompiler, it seems to be mostly wrappers around a single call to System.Directory's copyFileWithMetadata. Is there any reason a symbolic link version couldn't be defined by swapping out that for createFileLink like below:

diff --git a/lib/Hakyll/Core/File.hs b/lib/Hakyll/Core/File.hs
index 49af659..6a5775e 100644
--- a/lib/Hakyll/Core/File.hs
+++ b/lib/Hakyll/Core/File.hs
@@ -8,6 +8,8 @@ module Hakyll.Core.File
     , copyFileCompiler
     , TmpFile (..)
     , newTmpFile
+    , SymlinkFile (..)
+    , symlinkFileCompiler
     ) where
 
 
@@ -20,6 +22,7 @@ import           System.Directory              (copyFileWithMetadata)
 import           System.Directory              (copyFile)
 #endif
 import           System.Directory              (doesFileExist,
+                                                createFileLink,
                                                 renameFile)
 import           System.FilePath               ((</>))
 import           System.Random                 (randomIO)
@@ -56,6 +59,19 @@ copyFileCompiler = do
     provider   <- compilerProvider <$> compilerAsk
     makeItem $ CopyFile $ resourceFilePath provider identifier
 
+--------------------------------------------------------------------------------
+-- | This will not copy a file but create a symlink, which can save space & time for static sites with many large static files which would normally be handled by copyFileCompiler. (Note: the user will need to make sure their sync method handles symbolic links correctly!)
+newtype SymlinkFile = SymlinkFile FilePath
+    deriving (Binary, Eq, Ord, Show, Typeable)
+--------------------------------------------------------------------------------
+instance Writable SymlinkFile where
+    write dst (Item _ (SymlinkFile src)) = createFileLink src dst
+--------------------------------------------------------------------------------
+symlinkFileCompiler :: Compiler (Item SymlinkFile)
+symlinkFileCompiler = do
+    identifier <- getUnderlying
+    provider   <- compilerProvider <$> compilerAsk
+    makeItem $ SymlinkFile $ resourceFilePath provider identifier

The one part that puzzles me is that createFileLink src dst creates self-links. I can try something like prepending the absolute path like ("/home/gwern/wiki/"++src) but I don't understand where the correct relative/absolute path prefix comes from since I thought src dst would look like docs/foo.pdf _site/docs/foo.pdf but that's obviously not how it works...

(While a hack, prepending does work: I go from a _site/ of 41GB to <0.2GB. A good 10 minutes faster too.)

@gwern
Copy link
Contributor Author

gwern commented Oct 31, 2020

Any feedback on this? I'd particularly like this upstreamed because my attempts to define it inside my own hakyll.hs have foundered on type issues with the deriving Binary & Item; they work inside File.hs but not elsewhere, requiring me to keep a forked Hakyll installed. (At this point, I'm low enough on disk space that I wouldn't be able to compile gwern.net without this optimization.)

@Minoru
Copy link
Collaborator

Minoru commented Nov 11, 2020

my attempts to define it inside my own hakyll.hs have foundered on type issues with the deriving Binary & Item

Please submit this as PR, I'll merge it.

@gwern
Copy link
Contributor Author

gwern commented Nov 12, 2020

Minoru pushed a commit that referenced this issue Nov 12, 2020
@Minoru
Copy link
Collaborator

Minoru commented Nov 12, 2020

Thanks @gwern!

Minoru added a commit that referenced this issue Nov 12, 2020
…s, this can be a major speedup (see #786) (#810)

Co-authored-by: gwern <[email protected]>
@gwern
Copy link
Contributor Author

gwern commented Mar 10, 2021

So I happened to undo my local patch while doing a reinstall of my Pandoc toolchain to pull in a fix related to <figure> handling, and I think there was a misunderstanding here: my patch above is not correct. It results in symbolic self-links which are totally broken, eg

...
ls: cannot access '_site/Zeo.page': Too many levels of symbolic links
$ ls -l _site/*.page
lrwxrwxrwx 1 gwern gwern 32 Mar  9 21:01 _site/2012-election-predictions.page -> ./2012-election-predictions.page
...

That is what I was referring to in my discussion of hacking src to make it point to a correct filepath like _site/2012-election-predictions.page -> /home/gwern/wiki/2012-election-predictions.page. It needs some relatively small but unknown to me tweak to make it correct and point to ../.

I thought when you committed you'd fixed that, but trying just now it seems that is not the case?

@Minoru
Copy link
Collaborator

Minoru commented Mar 10, 2021

My bad! I somehow overlooked your warning about relative links when I suggested to merge this.

I thought src dst would look like docs/foo.pdf _site/docs/foo.pdf but that's obviously not how it works...

From my reading of the code, that's exactly how it works. The problem is that relative symlinks are resolved relatively to the directory in which they reside, so "./docs/foo.pdf", when resolved from inside "_site/docs/", points to "_site/docs/docs/foo.pdf".

One way to fix it would be to use System.Directory.makeAbsolute in symlinkFileCompiler. But I don't like this, because then the _site directory can't be moved to another place without breaking the links.

The other option is to make src relative to dst, but I don't see a function in System.Directory that does this. The only candidate, System.FilePath.makeRelative, explains that it doesn't introduce .. into the paths, because one of the parent directories might be itself a symlink, and going up from it might lead us to a different place altogether.

We can write our own "relativization" function: 1) take destinationDirectory, replace all components with ..; 2) take the item route, drop the filename, replace directory components with ..; 3) concatenate (1), (2), and the route. This still suffers from the same problem that's outlined in the doc for makeRelative, but I think it's on the user if they copy something into a directory which is itself a symlink. (But I think this situation is impossible, because Hakyll executes rules in arbitrary order, and if the directory doesn't exist, it'll be created.)

Alternatively, use hard links. But that'll require separate code for *nix and Windows, I believe.

I don't have the energy to work on this myself. If you want to push this to completion, I'm open to further discussions, you can bounce ideas off me if you want. Otherwise I can just revert the current version, re-open this issue, and wait until someone gets motivated to finish this off.

jaspervdj added a commit that referenced this issue Mar 14, 2021
…yte sites, this can be a major speedup (see #786) (#810)"

This reverts commit 8415767.
@Minoru
Copy link
Collaborator

Minoru commented Mar 14, 2021

Okay, the patch is now reverted. Sorry for the mess I've caused here >_<

Let's wait until someone has energy to brush this up and submit a new one.

@Minoru Minoru reopened this Mar 14, 2021
@gwern
Copy link
Contributor Author

gwern commented Mar 19, 2021

If it's unclear which function to use, perhaps we can push it onto the user. Right now my hack is to add in a /home/gwern/wiki/ prefix to make the symlink paths absolute (and then it rsyncs fine to the actual server). Perhaps the function can be parameterized to take such a prefix? Defaulting to the current working directory. So then I'd write compile (symlinkFileCompiler Nothing) or to be explicit, compile (symlinkFileCompiler $ Just "/home/gwern/wiki/").

@Minoru
Copy link
Collaborator

Minoru commented Mar 28, 2021

Sorry for such a delay replying, I got buried under some life stuff.

Upon re-reading the thread, I think the easiest way forward is to use hard links, and implement them just for the OS that you, @gwern, are using. If somebody needs it on a different OS, they can submit a patch later. If somebody absolutely needs symbolic links (e.g. because their destination directory is on a different disk), they can re-visit this issue and see what they can come up with. What do you think of that?

In case you're against that, I'll also comment on parameterising symlinkFileCompiler: I think it's better to have a separate function for this, like symlinkFileCompilerWithBasePath or something. Once the path-relativization kinks are worked out, we can provide a shorter symlinkFileCompiler that doesn't need a path.

@gwern
Copy link
Contributor Author

gwern commented Mar 28, 2021

I have not tried using hardlinks before, but I'm willing to give it a try.

@gwern
Copy link
Contributor Author

gwern commented May 7, 2024

Any update on this? Was there any hardlink patch I was supposed to test?

@Minoru
Copy link
Collaborator

Minoru commented May 8, 2024

Not from me; I didn't find the energy to write the hardlinking patch yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants