Plotting manuscript progression for methods manuscript #952

agitter · 2021-04-29T21:28:11Z

For the ACM-BCB submission we could plot the following manuscript statistics over time:

Number of authors who added themselves to metadata
Number of references
Word count

All of these are available in the files variables.json and references.json in the output branch of the repo. Some quick Python experimentation shows how to access these values and the corresponding date:

import json
variables = json.load(open('variables.json'))
>>> variables['pandoc']['date-meta']
'2021-04-28'
>>> variables['manubot']['date']
'April 28, 2021'
>>> len(variables['manubot']['authors'])
49
>>> variables['manubot']['manuscript_stats']['word_count']
131943
references = json.load(open('references.json', encoding='utf-8'))
>>> len(references)
1428

I don't know the most efficient way to get these for every commit in the output branch. However

git log --pretty=format:"%h" > output-commits.txt

dumps a list of all commits to a text file that we could iterate over.

Pseudocode for an algorithm could look like:

dump all commits on the output branch
foreach output branch commit
checkout the commit in a subprocess
load the json files as shown above and store the commit sha1, manuscript date, author count, word count, and reference count in a row in a dataframe
plot the data from the dataframe

Doing this with a Python script would be messy due to the subprocess calls to issue git commands, but it's possible and I don't know the GitPython package well enough to do it that way. For example

subprocess.run(["git", "checkout", "3839cc2e"])

will checkout a specific commit from the output branch.

The text was updated successfully, but these errors were encountered:

rando2 · 2021-04-29T21:29:54Z

Thank you so much @agitter! I'm going to try to finish getting the text cleaned up (ish), then send it to authors and start working on parsing #17, and then hopefully take this on in the morning!

rando2 · 2021-04-30T12:16:39Z

Initial prototype is working 👍 Thanks for the point in the right direction, @agitter, those json suggestions make it WAY easier than what I was imagining (which involved a lot of regex).

This is obviously EXTREMELY ROUGH and needs to be visually cleaned up in basically every way, but it is a graph of the data!

agitter · 2021-04-30T12:25:25Z

That's amazing!

We should be able to flip the order of the dates and show fewer x-axis ticks (e.g. monthly) without too much trouble.

Can we account for the big spikes in word count? My first guess is that the initial big spike was adding the reviews as an appendix. Then the other sharp increase and decrease could be when you duplicated text to convert from a single paper to multiple papers, but I don't know whether the timing matches.

Nevertheless, having the data plotted is very cool and helps make the point that a git-managed manuscript enables lots of inspection and analysis that is impossible with a typical writing process. Maybe not "impossible" with LaTeX, but it would be painful to analyze every commit without having these stats ready to go in json files.

rando2 · 2021-04-30T12:50:52Z

Can we account for the big spikes in word count?

Yes! The first one is when we added the appendix and the second one is likely when we accidentally duplicated the appendix 😆 Unfortunately this makes it super clear that it sat there duplicated for a long time before anyone noticed! I think most of the text I duplicated for the manuscript splitting process is still duplicated (since I use blame pretty heavily while adding the attributions of text that I moved between documents!)

agitter added the Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts label Apr 29, 2021

agitter assigned rando2 Apr 29, 2021

agitter unassigned rando2 Apr 29, 2021

rando2 mentioned this issue Apr 30, 2021

Add visualization of manuscript stats #953

Closed

2 tasks

rando2 mentioned this issue Aug 30, 2021

Visualize manuscript/project growth #1019

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plotting manuscript progression for methods manuscript #952

Plotting manuscript progression for methods manuscript #952

agitter commented Apr 29, 2021

rando2 commented Apr 29, 2021

rando2 commented Apr 30, 2021

agitter commented Apr 30, 2021

rando2 commented Apr 30, 2021 •

edited

Loading

Plotting manuscript progression for methods manuscript #952

Plotting manuscript progression for methods manuscript #952

Comments

agitter commented Apr 29, 2021

rando2 commented Apr 29, 2021

rando2 commented Apr 30, 2021

agitter commented Apr 30, 2021

rando2 commented Apr 30, 2021 • edited Loading

rando2 commented Apr 30, 2021 •

edited

Loading