Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

RFC: provide insight into the status of gogit #570

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

strib
Copy link
Contributor

@strib strib commented Aug 30, 2017

A project I'm working on uses gogit to power a git remote helper
(i.e., it gets invoked by git when you do git clone <protocol>://..., etc from the command line). On big repos, gogit
and our backing store are slow enough that having printed progress
indicators makes a huge difference to usability, and better allows us
to debug where slownesses might be.

This is a quick-and-dirty PR that lets gogit communicate its status
back to some listener. I haven't fixed up the tests yet, or written
any new tests, because I wanted to get feedback first on if this is
even a good approach or not. Please take a look and let me know!

(Note that this is different from the sideband progress stuff, which
only lets you know the progress of the remote server. This PR gives
the status of the local client.)

A project I'm working on uses gogit to power a git remote helper
(i.e., it gets invoked by git when you do `git clone
<protocol>://...`, etc from the command line).  On big repos, gogit
and our backing store are slow enough that having printed progress
indicators makes a huge difference to usability, and better allows us
to debug where slownesses might be.

This is a quick-and-dirty PR that lets gogit communicate its status
back to some listener.  I haven't fixed up the tests yet, or written
any new tests, because I wanted to get feedback first on if this is
even a good approach or not.  Please take a look and let me know!

(Note that this is different from the sideband progress stuff, which
only lets you know the progress of the remote server.  This PR gives
the status of the local client.)
@strib
Copy link
Contributor Author

strib commented Aug 30, 2017

@mcuadros @erizocosmico: I don't want this merged as is, I'm just looking for feedback on whether this is a reasonable approach, and if not any suggestions on how better to do it. Thanks!

@mcuadros
Copy link
Contributor

Personally I don't like having all the logic for status mixed across all the package.

I prefer having an optional storage.Storer wrapper where the call to some methods are intercepted with the goal of capture metrics.

If some information cannot be retrieve, from this method, I rather adding a new interface to the storage package to allow to the plumbing communicate this information.

@mcuadros
Copy link
Contributor

Will be nice sort the ideas, and write down all the information that needs/wants to be retrieve from which operations.

Related #549

@strib
Copy link
Contributor Author

strib commented Aug 30, 2017

Ah thanks for the pointer to #549. I can summarize some things there, though I'm also interested in push (more so than fetch/clone actually).

Having an interface rather than a bare channel seems reasonable. However, the type of info that's useful for status seems almost entirely orthogonal to the Storer interface to me. It's more about the processing/computation that happens before data begins flowing through Storer. If you take a look at this PR, you'll see it's mostly around walking the object structure, and Storer is too low level to be able to track the stats in a meaningful way.

Later today I'll put together a list of what's tracked in this PR and post it here and in #549.

@strib
Copy link
Contributor Author

strib commented Aug 30, 2017

Here is a summary of the useful status information I've found that gogit needs to report. This is from the perspective of using go-git as a library to build a git-remote-helper. That means my project interfaces with the local git repo as a remote, using the file protocol. Thus, actually transferring the data back and forth to the remote is not a bottleneck for us, and I haven't included that in this PR or in the list below (though in a more general interface, it would of course be needed!).

For fetch/clone:

  • Fetch: decoding objects being transferred from the remote, and writing them to the Storer:
    • Number of objects decoded so far / total number of objects expected.
    • The numerator could maybe be figured out from a Storer implementation, through the SetEncodedObject method, though it would need to be aware about what stage it was in.
  • Index: building the index has three phases, which each object being touched in each phase. So for purposes of tracking progress, this needs to be broken down into three separate stages:
    • IndexHash: Writing the hash of each object to the packfile.
      • Number of hashes written so far / total number of hashes expected.
    • IndexCRC: Writing the CRC of each object to the packfile.
      • Number of CRCs written so far / total number of CRCs expected.
    • IndexOffset: Writing the offset of each object to the packfile.
      • Number of offsets written so far / total number of offsets expected.

For push:

  • Count: walking the revlist to count how many objects need to be transferred:
    • Total number of objects counted so far.
  • Read: reading/decoding each of the objects from the Storer.
    • Number of objects read so far / total number of objects expected.
    • The numerator could maybe be figured out from a Storer implementation, through the DeltaObject and EncodedObject methods, though it would need to be aware about what stage it was in.
  • Sort: sorting the objects by size/type.
    • This is typically pretty fast, but maybe not for huge repos. No specific numbers to track here other than time spent in this stage.
  • Delta: running delta compression for each object.
    • Number of object deltas considered so far / total number of delta considerations expected.
  • Send: encoding objects into the packfile stream.
    • Number of objects encoded / number of objects expected.

Lastly, there is a Done phase for both fetches and pushes, which is probably unnecessary as the caller can just end things when the Push or Fetch call returns.

So I think 2 of the 9 phases can be covered by a Storer implementation, as long as it gains additional knowledge about when it is entering a particular phases. (This could be done via an optional Storer wrapper, as you suggest.) 6 of the other phases occur in places (revlist, packfile.Encoder, and packfile.deltaSelector) that have an EncodedObjectStorer, but not a full Storer. The remaining phase is in idxfile.Encoder which only has a raw io.Writer and would probably need a new object plumbed through for stats.

@mcuadros What do you think is the best way for me to proceed? Here's one proposal:

  • Add a simple new plumbing/storer.StatCollector interface has methods for:
    • Registering the start of each stage, as indicated by a provided enum, and optionally a target expected count.
    • A method for incrementing the number of items processed during a given phase.
  • Pass this directly into the idxfile.Encoder constructor.
  • Try to cast it from Storer and EncodedObjectStorer in the remaining places.
  • In this PR, instead of calling using StatusChan, use the casted object instead and the API above, if the casted object is non-nil.
    • I would slightly prefer calling it explicitly instead of relying on the Storer object to interpret calls to DeltaObject, EncodedObject and SetEncodedObject in a particular way, for the sake of consistency across all the stats, but if you feel strongly about it I could be convinced.

Open to other suggestions.

@mcuadros
Copy link
Contributor

mcuadros commented Sep 8, 2017

I didn't forgot you. ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants