-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated next_partname generation #646
base: master
Are you sure you want to change the base?
Conversation
… when generating 1000+ slides
So what I did is that on each loop iteration I create new I know that it could be improved further more, just need some tips on what I have missed during this improvement... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some feedback. Note you are unlikely to get this committed, so you should probably focus on getting it working well enough for your purposes in a local spike.
@@ -10,3 +10,4 @@ _scratch/ | |||
/spec/gen_spec/spec*.db | |||
tags | |||
/tests/debug.py | |||
venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A git-ignore particular to your personal development environment belongs in your own gitignore file, not the project's. You find that somewhere like ~/.config/git/ignore
or perhaps ~/.gitconfig
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that, will check it and update.
@@ -26,6 +29,7 @@ class OpcPackage(object): | |||
|
|||
def __init__(self): | |||
super(OpcPackage, self).__init__() | |||
self.partnames = defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid adding instance variables to the interface of the class (in other words, make instance variables private. If you need the value outside the class, make a property for it). So this would be self._partnames
.
return PackURI(candidate_partname) | ||
raise Exception("ProgrammingError: ran out of candidate_partnames") | ||
name = tmpl.split(os.sep)[2] | ||
self.partnames[name] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing where ._partname
gets initialized with the part-names already present in the package. I think a better strategy would be to make ._partnames
an @lazyproperty
(rather than an instance variable) and do the initialization in that lazyproperty (which only gets executed the first time it is called).
if candidate_partname not in partnames: | ||
return PackURI(candidate_partname) | ||
raise Exception("ProgrammingError: ran out of candidate_partnames") | ||
name = tmpl.split(os.sep)[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unreliable. Are you sure you can count on the name always being the third item? Better to figure it is the last item, like tmpl.split(os.sep)[-1]
. Also, I'm not sure there's a compelling reason to split out the name at all. What is lost by keeping the path in there? The more important thing would be to split before the %d
so you could match existing part-names with something like part_name.startswith("/ppt/slides/slide")
, although be careful because /ppt/slides/slideMaster42.xml
also starts with "/ppt/slides/slide"
. You'll probably need to use a regular expression or similar method to separate out the numeric suffix.
name = tmpl.split(os.sep)[2] | ||
self.partnames[name] += 1 | ||
candidate_partname = tmpl % self.partnames[name] | ||
return PackURI(candidate_partname) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the tests? You would have found most of these problems with good tests.
Just curious, why wouldn't these changes get merged (even if your comments were addressed)? I understand that a PowerPoint with thousands of slides isn't a case that you should overhaul a bunch of code for, but if it's a simple addition I'm wondering what the hesitation is. I'll share my case since it's a bit relevant. I'm generating a powerpoint 12k slides large, to do so I'm creating plots and adding the plots to the slide, 60 slides at a time, in a pipeline. So for example when I'm generating plots for slides 60-119, I simultaneously create slides 0-59. The next iteration I'll generate plots for slides 120-179 and create slides 0-59 with the plots I just generated and so on. It always takes longer to create the slides than to create the plots, making me wait for the former on each iteration. The reason I share this is to explain the following graph. which shows the difference in time between the slide and plot generation: At first, creating the first 60 slides only takes 30s longer than creating the plots, this stays under a minute for about 3000 slides, but thereafter quickly grows to delays that are unmanageable. I took the integral of the trendline, and the total delay adds up to 24 hours. If it were constant at 30s, it would be just an hour and a half. Do you have any recommendations here? Perhaps I can create a bunch of smaller powerpoints and manually add them together, it would definitely shave a bunch of time but require manual work in a process I was trying to automate. Hope this comment wasn't too long-winded, just wanted to explain the sticky situation I'm in. |
Hi @Tzahi12345, the performance problem here arises from the The reason it does this is that the "state" of partname assignments is reflected only in the current underlying XML, or package-state anyway. So each time a new partname is requested, that full state needs to be consulted to rule out any duplicates. Any caching of that partname-assignments state in a more efficient data structure (like a dict or set having O(1) time complexity) risks getting out of synch with the actual package state. You would have to guarantee that every new partname assignment "registered" itself with that redundant record/cache. This is probably possible, but would require substantial reasoning about the current code and where to place such a "one-and-only-one-per-open-package" data structure and the code that updated it. Do we know that every new part calls this method, for example. And what about when parts are removed? Another possible way would be to give parts partnames that were not sequential, like maybe SHA1 hashes or hash prefixes computed from something including datetime or whatever. That would be ugly but I don't think it would violate the OPC spec. Another approach is to set an "I'll be really careful" flag and allow a cache to be built that could possibly become invalid if you didn't follow a prescribed procedure to request new partnames. We did something similar to this for adding new shapes to a slide when a sponsor needed to add thousands of shapes to each slide. I think we called it turbo mode, so that might be worth a search. Anyway, I suppose the main reason is it's not a big enough problem for someone with project funding to have decided they're willing to sponsor it. If you wanted to "fix" it for your own purposes, you could fork One other approach might be to monkey-patch |
I think this may be the best/safest approach here. I understand the issues with the cache, one thing I'm curious about is why it may get out of sync. I'm not familiar with the codebase but if all add/remove part operations use the same function, every addition/removal of a part can involve updating the cache, there must be something I'm missing here. Anyways I do understand why it's not high priority or why it would never get merged, it's not a highly requested feature nor something a sponsor is requesting. I did try out @simkusr's fork and it worked well -- I'm not doing anything fancy so I'm not surprised. I'll keep using that for the time being, thank you for the lengthy answer! |
@Tzahi12345 yes, I think you're right. I don't know of any place that adds a part that doesn't use On the other hand, if I were only doing it for my own uses, I'd definitely implement it first and give it a try. If it worked for me I'd leave it at that. I'd start by adding a So the (O(1)) implementation of @lazyproperty
def _max_partnames(self):
"""defaultdict(int) mapping partname templates to max count so far.
... more details needed to understand proper use ... all keys work, value is zero for template with
no partname so far, etc. ...
"""
max_partnames = collections.defaultdict(int)
# --- initialize partname-number cache ---
for part in self.iter_parts():
tmpl, n = parse_partname(part.partname)
max_partnames[tmpl] = max(max_partnames[tmpl], n)
return max_partnames
def next_partname(self, tmpl):
"""Return next available |PackURI| partname matching `tmpl`."""
n = self._max_partnames[tmpl] + 1
self._max_partnames[tmpl] = n
return PackURI(tmpl % n) You'll need to work out your own implementation of |
Thanks for the tips! I'll see if I can get the time to implement this -- I probably will soon as this PR isn't as great as I thought (there was still a O(n^2) issue). In orange you can see the performance of this PR. Over the course of the 27hr job, this only saves me 3.5hrs. I haven't looked through the codebase so I'm not sure where your implementation and @simkusr's differ, but these are the results I saw. Worst case, there's some other scaling issue hiding there but since this issue has been seen through profiling, I'll assume there isn't. |
@Tzahi12345 try skipping adding the image and see how the time changes. Still generate it, just comment out the There's potentially quadratic time in Also, how big is the file that you generate and how much memory does your machine have? The whole file times a significant multiplier (at least 3, maybe 8) is being built up in memory until you save. You could just be swapping to disk after a certain point. |
Btw, I'm not seeing anything O(N^2) in the |
Is it feasible to disembowel :-) add_picture() in a local fork to remove the check? Similarly, is it feasible to add a "check for existence" boolean parameter that defaults to True? |
@MartinPacker such a thing could make sense. I think there would be better options, like caching the SHA1 hash of all current images in a set or dict to allow O(1) lookup. But right now I'm not sure that's where the time is going. I can't see it taking even a second to search through 12,000 objects for a constant (already computed) SHA1. I'm thinking the time has to be going somewhere else. My money is on memory being swapped to disk at the moment. We've got something like 1000 seconds to account for on each slide. |
I re-ran and checked for hard faults after a noticeable increase in delay (slide index 5000 or so) and there was still a decent amount of memory free. Some stats:
But, I did extrapolate out to slide 12000 and it looks like I'd only have about 400MB left of memory by the end of it, so I could imagine that would contribute to the latency. I also ran without |
@Tzahi12345 okay, this finding is very revealing. So my next suspect is this method: https://github.com/scanny/python-pptx/blob/master/pptx/package.py#L171-L183 It iterates though each of the package relationships before each image-add operation to see if the same image already exists. That way if you're using the same image on every slide or whatever you only embed one copy of it in the .pptx package. So that's O(N) with N roughly the number of parts. Each slide is a part and each image is a part, so that probably gets up to 2500 items to iterate through for each added image by the end. Total iterations would be N(N-1)/2, which is around 3 million. It's still hard for me to believe that would take hours, but it's a worthy suspect. To test this hypothesis, we make that Btw, what is your CPU doing while this process is running? Is it pinned at 100% (of one core) or is it lollygagging around waiting for disk access or something? |
Actually, you know what? This https://github.com/scanny/python-pptx/blob/master/pptx/package.py#L141-L155 Where quadratic time is about 3 million time-units, quartic time is more like 5 trillion time-units. That I can totally believe takes hours, no matter how small "time-unit" is :) |
(Sorry if this is parenthetic but I want to check one thing that will affect my project - md2pptx). So, @scanny, it seems you're saying that the PowerPoint format allows reuse of graphics - where only one copy is stored and the slides that need it point to it. I was considering - in md2pptx - trying to avoid the creation of duplicate graphics where they are used more than once in a presentation. It seems I don't need to - if python-pptx is deduping for me. I do hope this fix of using a set works. That would completely obviate the need for me to dedup in my code - even if I could figure out how to. |
@MartinPacker Yes, that's my understanding of PowerPoint's behavior and that's how I implemented image storage in I don't recall how thoroughly I experimented with PowerPoint on that initially. I may have just tried all on the same slide, so those would be interesting results to reproduce. But I saw it was clearly not duplicating the image in at least same-slide circumstances so I just went to "save one-time only". The size of a PowerPoint file is generally determined primarily by the size of the images it contains, so it didn't take much to make me believe its designers would have taken the time to implement this space optimization. The de-duping does make things a little more complicated though, as one might expect. Basically the strategy I used is to hash each image binary (e.g. .jpg "file") and use that to detect duplicates. Then each new one needs to be distinct from all the existing ones before it is added, otherwise we just point to the existing one. That's what the A little fancier is getting rid of one when the last reference is deleted. I think that happens automatically on save, like that image part becomes an orphaned node when the last reference is removed and only connected nodes in the part graph end up in the saved package. Anyway, all that should be transparent to your app. Note that changing the I believe further time optimizations are possible, but perhaps not without risk or broader change. I'm betting this one-liner will reduce the time by a very large amount and then we can see if we need to go further. I would commit that |
Now partnames are generated on the fly and cashed for the next iteration of slides. This update solves bottleneck mentioned in #644