Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting manuscript progression for methods manuscript #952

Open
agitter opened this issue Apr 29, 2021 · 4 comments
Open

Plotting manuscript progression for methods manuscript #952

agitter opened this issue Apr 29, 2021 · 4 comments
Labels
Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts

Comments

@agitter
Copy link
Collaborator

agitter commented Apr 29, 2021

For the ACM-BCB submission we could plot the following manuscript statistics over time:

  • Number of authors who added themselves to metadata
  • Number of references
  • Word count

All of these are available in the files variables.json and references.json in the output branch of the repo. Some quick Python experimentation shows how to access these values and the corresponding date:

import json
variables = json.load(open('variables.json'))
>>> variables['pandoc']['date-meta']
'2021-04-28'
>>> variables['manubot']['date']
'April 28, 2021'
>>> len(variables['manubot']['authors'])
49
>>> variables['manubot']['manuscript_stats']['word_count']
131943
references = json.load(open('references.json', encoding='utf-8'))
>>> len(references)
1428

I don't know the most efficient way to get these for every commit in the output branch. However

git log --pretty=format:"%h" > output-commits.txt

dumps a list of all commits to a text file that we could iterate over.

Pseudocode for an algorithm could look like:

  • dump all commits on the output branch
  • foreach output branch commit
  • checkout the commit in a subprocess
  • load the json files as shown above and store the commit sha1, manuscript date, author count, word count, and reference count in a row in a dataframe
  • plot the data from the dataframe

Doing this with a Python script would be messy due to the subprocess calls to issue git commands, but it's possible and I don't know the GitPython package well enough to do it that way. For example

subprocess.run(["git", "checkout", "3839cc2e"])

will checkout a specific commit from the output branch.

@agitter agitter added the Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts label Apr 29, 2021
@rando2
Copy link
Collaborator

rando2 commented Apr 29, 2021

Thank you so much @agitter! I'm going to try to finish getting the text cleaned up (ish), then send it to authors and start working on parsing #17, and then hopefully take this on in the morning!

@rando2
Copy link
Collaborator

rando2 commented Apr 30, 2021

Initial prototype is working 👍 Thanks for the point in the right direction, @agitter, those json suggestions make it WAY easier than what I was imagining (which involved a lot of regex).

This is obviously EXTREMELY ROUGH and needs to be visually cleaned up in basically every way, but it is a graph of the data!
Screen Shot 2021-04-30 at 8 16 13 AM

@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2021

That's amazing!

We should be able to flip the order of the dates and show fewer x-axis ticks (e.g. monthly) without too much trouble.

Can we account for the big spikes in word count? My first guess is that the initial big spike was adding the reviews as an appendix. Then the other sharp increase and decrease could be when you duplicated text to convert from a single paper to multiple papers, but I don't know whether the timing matches.

Nevertheless, having the data plotted is very cool and helps make the point that a git-managed manuscript enables lots of inspection and analysis that is impossible with a typical writing process. Maybe not "impossible" with LaTeX, but it would be painful to analyze every commit without having these stats ready to go in json files.

@rando2
Copy link
Collaborator

rando2 commented Apr 30, 2021

Can we account for the big spikes in word count?

Yes! The first one is when we added the appendix and the second one is likely when we accidentally duplicated the appendix 😆 Unfortunately this makes it super clear that it sat there duplicated for a long time before anyone noticed! I think most of the text I duplicated for the manuscript splitting process is still duplicated (since I use blame pretty heavily while adding the attributions of text that I moved between documents!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts
Projects
None yet
Development

No branches or pull requests

2 participants