Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelTable Musings #185

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

ParallelTable Musings #185

wants to merge 12 commits into from

Conversation

JSKenyon
Copy link
Collaborator

  • Tests added / passed

    $ py.test -v -s daskms/tests

    If the pep8 tests fail, the quickest way to correct
    this is to run autopep8 and then flake8 and
    pycodestyle to fix the remaining issues.

    $ pip install -U autopep8 flake8 pycodestyle
    $ autopep8 -r -i daskms
    $ flake8 daskms
    $ pycodestyle daskms
    
  • Fully documented, including HISTORY.rst for all changes
    and one of the docs/*-api.rst files for new API

    To build the docs locally:

    pip install -r requirements.readthedocs.txt
    cd docs
    READTHEDOCS=True make html
    

This PR is a WIP demonstrating a possible approach for parallel reads from threads. This approach is reliant on casacore/casacore#1167, which allows me to avoid using soft links. Instead, the changes in that PR mean that when a table is opened from multiple threads, it does not share its underlying plain table object.

The approach that I am attempting here is almost certainly imperfect but it is very simple. It defines a ParallelTable class which inherits from pyrap.tables.table. This, unfortunately, introduces some limitations as the base class is defined in C++. That said, doing this allows us to create a ParallelTable object which masquerades as a normal table - the only difference is that when a read method is called, it first checks if the thread has an open instance of the table. If not, the table is opened in the thread and added to the cache. I make use of weakref to ensure that all tables are closed when the ParallelTable object is GCed.

The changes in this PR seem to work although some tests are broken - I suspect this may have to do with subtables, but I have yet to investigate. Note that there is plenty of ugly debugging code in the PR. I will remove it if this coalesces into a stable approach.

One important thing to note is the fact that the cf.ThreadPoolExecutor has been dummied out with a DummyThreadPoolExecutor and DummyFuture. This seems to work for a simple read case, though further testing is needed. This would be a nice simplification as it suggests that we could get away without internal threadpools. That said, the changes in the PR also work with the internal threadpools with the caveat that those threadpools need more than one thread (as otherwise we serialise).

Finally, one thing to note is that using the processes scheduler does not function optimally for both this PR and master. Both will repeatedly open tables for reasons I don't fully understand. I suspect that the caching mechanism on the TableProxy doesn't function as expected in this specific case. What is particularly confusing is that it does seem to operate correctly in the distributed case using a LocalCluster with multiple workers.

table.close()


class ParallelTable(Table):
Copy link
Member

@sjperkins sjperkins Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ParallelTable(Table):
class ParallelTable(metaclass=...)

As I read the ParallelTable class, it is predicated around proxying the encapsulated Table objects as opposed to overriding inherited methods of the Table class.

Therefore it's unnecessary to inherit from Table as proxied table objects are created in _get_table.

This way, it's also possible to use a metaclass without getting tangled up with a boost python subclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to grasp this, but I am still struggling. The problem I see is that failing to inherit from pyrap.tables.table means that the ParallelTable object will not in fact appear to be a table i.e. self._table_future = table = ex.impl.submit(factory, *args, **kwargs) will not be a future pointing at a table. What I could see working is defining a ParralelTableProxy which simply inherits from TableProxy but defines its own metaclass which modifies the behaviour of get* operations. Currently, ParallelTable is itself a table (i.e. you can do something like pt.taql(query, ParallelTable)) in addition to simply rerouting get* operation through a cache. In other words, there will always be one extra copy open - the "root" table which gets used for unmodified methods. I will take a swing at the ParallelTableProxy idea.

Copy link
Collaborator Author

@JSKenyon JSKenyon Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind clarifying what the problem is with this approach? Currently, all the ParallelTable does is override some inherited methods of pyrap.tables.table prior (!!) to the being embedded in a TableProxy. This just means that the proxy proxies these special methods, rather than those on the base class. This yields a really simple solution as subsequent operations proceed as normal. The ParallelTable is a table, and supports all pyrap.tables.table methods, and will have all the relevant multiton patterns applied inside the TableProxy. I have tried creating a ParallelTableProxy, but that becomes difficult as one needs to access the cache inside get* methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented a ParallelTableProxy. At present it segfaults, but I know why. The problem stems from the fact that getter_wrapper in reads.py calls methods directly on the underlying table object, rather than via the TableProxy. This is problematic as TableProxy.method may not be the same as pyrap.tables.table.method. This is precisely where my current segfaults come from, as getter_wrapper calls the non-threadsafe get* functions on the underlying table.

self._cached_tables = {}
self._table_path = args[0] # TODO: This should be checked.

super().__init__(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
super().__init__(*args, **kwargs)

daskms/parallel_table.py Show resolved Hide resolved
@bennahugo
Copy link
Collaborator

bennahugo commented Feb 25, 2022 via email

@JSKenyon
Copy link
Collaborator Author

I am also wary of rushing to a solution on the casacore end @bennahugo. Ger may have a different solution in mind. I am just experimenting on the dask-ms end and trying to get something working. This is all moot if we just end up with a different scenario in which everything locks up.

@bennahugo
Copy link
Collaborator

Sure I'm just noting down our discussion -- it won't help to to put the cart in front of the horses here, but it would be good to have a draft solution for this. Thanks for the work on dask-ms front

@JSKenyon JSKenyon mentioned this pull request Feb 25, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants