Skip to content

Intermediate Examples Legacy

Ben Murray edited this page Apr 23, 2021 · 3 revisions

Intermediate - Legacy (pre 0.5)

Thinking in numpy

The most important thing to understand when working with readers is that, once you have read the data from a reader, it is in a numpy ndarray and all of the good practices of working with numpy arrays applies. In particular, performance is massively affected by dropping out of numpy for any reason (iterating loops explicitly, for example).

Combining filters - awkward way

foo = session.get(file['foo']).data[:] # a numeric value
bar = session.get(file['bar']).data[:] # a boolean value
result = session.create_numeric(file, 'result', 'bool')

# terrible way - very slow
for i in len(foo):
  result.write_part(foo[i] == 1 and bar[i] is False)
result.complete()

# better but still slow
result_= np.zeros(len(foo), dtype=np.bool)
for r in len(result):
  result_[r] = foo[r] == 1 and bar[r] is False
result.write(result_)

Combining filters - numpy way

foo = session.get(file['foo']).data[:] # a numeric value
bar = session.get(file['bar']).data[:] # a boolean value
result = session.create_numeric(file, 'result', 'bool')
result.write((rfoo == 1) & (rbar == False))

Sorting

Putting your data into the most appropriate order is very important for scaling of complex operations. Certain operations in ExeTera require that the data is presented in sorted order in order to be able to run correctly:

  • ordered_merges
  • generation and application of spans

Changing sorted order can be done in one of two ways:

  1. Session.sort_on
  2. session.dataset_sort_index followed by session.apply_index

Either way, you must specify how you want the fields to be sorted. This is done through selecting the fields to be sorted on. You can select one or more fields and the fields will be applied in order

Sorting with sort_on

Session.sort_on is provided for when you want to sort all of the fields in a group. You can sort in-place or you can sort and add the resulted sorted fields to a destination group

Sorting in place

    # sort in place
    session.sort_on(source_group, source_group, ('id',), write_mode='overwrite')

Sorting to another group

    # source to a destination group
    session.sort_on(source_group, dest_group, ('foreign_key', 'created_at'))

Sorting with dataset_sort_index

When sorting with dataset_sort_index we first get the permutation of the current indices to the sorted order. We can then apply this to each of the fields that we want to reorder, as follows

    index = session.dataset_sort_index(session.get(src['foo']), session.get(src['bar']))
    fields_to_sort = ('boo', 'far', 'boofar')
    for f in fields_to_sort:
      session.apply_index(index, session.get(src[f]), session.create_like(src[f], dest, f))

    # alternatively, you can apply the index destructively to a field
    for f in fields_to_sort:
      session.apply_index(index, src[f], src[f]) 

Joining / merging - recommended way

ExeTera has merge_left, merge_right and merge_inner functions that provide pandas-like merge functionality.

id = session.get(patients['id'])
patient_id = session.get(assessments['patient_id']
age = session.get(patient['age'])
height = session.get(patient['height'])
assessment_age = session.create_like(patients['age'], assessments, 'age') 
assessment_height = session.create_like(patients['height'], assessments, 'height')
s.merge_left(patient_id, id, right_fields=(age, height), right_writers=(assessment_age, assessment_height))

Joining / merging - old way

Steps for carrying out joins

  • Requires primary key and foreign key
    • primary key for this example is patients.id
    • foreign key for this example is assessments.patient_id
  • Generate a set of indices for the foreign key (get_index)
    • Note, the data must be sorted so that data is sorted primarily on foreign key
  • Perform the join given a field in the destination space (required for length), an aggregation function, and the foreign key (join)
# Example, get assessment count per patient

# create the foreign key
pids = session.get(patients['id'])
apids = session.get(assessments['patient_id'])
wfkey = datastore.create_numeric(assessments, 'fkey', 'int64')

session.get_index(pids, apids, wfkey)

# join
fkey = session.get(assessments['fkey'])
a_counts = session.create_numeric(patients, 'a_counts', 'uint32')

session.join(pids, fkey, aggregated_counts, a_counts)

Advanced

Use of Session.process

Session.process provides scalability in the case that even a few individual fields are too large to fit in memory. Given a set of readers and writers, it internally iterates over subsets of readers and writers, applying a predicate that you supply to be run on each chunk in turn

def multiply_by_2(foo, bar):
    bar[:] = foo * 2

foo = session.get(file['foo'])
bar = session.create_numeric(file, 'bar', 'uint32')

session.process({'foo': foo}, {'bar': bar}, multiply_by_2)


It is often a sensible approach to write derived values to a second datastore. This approach allows you to easily separate your derived results from standard data and saves space as your initial datastore can be used for multiple implementation

with h5py.File(args.source, 'r') as src:
  with h5py.File(args.dest, 'w') as dest: # recreate dest from scratch
    ...

with h5py.File(args.source, 'r') as src:
  with h5py.File(args.dest, 'r+') as dest: # recreate dest from scratch
    ...