-
Notifications
You must be signed in to change notification settings - Fork 4
Intermediate Examples Legacy
The most important thing to understand when working with readers is that, once you have read the data from a reader, it is in a numpy ndarray and all of the good practices of working with numpy arrays applies. In particular, performance is massively affected by dropping out of numpy for any reason (iterating loops explicitly, for example).
foo = session.get(file['foo']).data[:] # a numeric value
bar = session.get(file['bar']).data[:] # a boolean value
result = session.create_numeric(file, 'result', 'bool')
# terrible way - very slow
for i in len(foo):
result.write_part(foo[i] == 1 and bar[i] is False)
result.complete()
# better but still slow
result_= np.zeros(len(foo), dtype=np.bool)
for r in len(result):
result_[r] = foo[r] == 1 and bar[r] is False
result.write(result_)
foo = session.get(file['foo']).data[:] # a numeric value
bar = session.get(file['bar']).data[:] # a boolean value
result = session.create_numeric(file, 'result', 'bool')
result.write((rfoo == 1) & (rbar == False))
Putting your data into the most appropriate order is very important for scaling of complex operations. Certain operations in ExeTera require that the data is presented in sorted order in order to be able to run correctly:
- ordered_merges
- generation and application of spans
Changing sorted order can be done in one of two ways:
- Session.sort_on
- session.dataset_sort_index followed by session.apply_index
Either way, you must specify how you want the fields to be sorted. This is done through selecting the fields to be sorted on. You can select one or more fields and the fields will be applied in order
Session.sort_on is provided for when you want to sort all of the fields in a group. You can sort in-place or you can sort and add the resulted sorted fields to a destination group
# sort in place
session.sort_on(source_group, source_group, ('id',), write_mode='overwrite')
# source to a destination group
session.sort_on(source_group, dest_group, ('foreign_key', 'created_at'))
When sorting with dataset_sort_index
we first get the permutation of the current indices to the sorted order.
We can then apply this to each of the fields that we want to reorder, as follows
index = session.dataset_sort_index(session.get(src['foo']), session.get(src['bar']))
fields_to_sort = ('boo', 'far', 'boofar')
for f in fields_to_sort:
session.apply_index(index, session.get(src[f]), session.create_like(src[f], dest, f))
# alternatively, you can apply the index destructively to a field
for f in fields_to_sort:
session.apply_index(index, src[f], src[f])
ExeTera has merge_left
, merge_right
and merge_inner
functions that provide pandas-like merge functionality.
id = session.get(patients['id'])
patient_id = session.get(assessments['patient_id']
age = session.get(patient['age'])
height = session.get(patient['height'])
assessment_age = session.create_like(patients['age'], assessments, 'age')
assessment_height = session.create_like(patients['height'], assessments, 'height')
s.merge_left(patient_id, id, right_fields=(age, height), right_writers=(assessment_age, assessment_height))
Steps for carrying out joins
- Requires primary key and foreign key
- primary key for this example is patients.id
- foreign key for this example is assessments.patient_id
- Generate a set of indices for the foreign key (
get_index
)- Note, the data must be sorted so that data is sorted primarily on foreign key
- Perform the join given a field in the destination space (required for length), an aggregation function, and the foreign key (
join
)
# Example, get assessment count per patient
# create the foreign key
pids = session.get(patients['id'])
apids = session.get(assessments['patient_id'])
wfkey = datastore.create_numeric(assessments, 'fkey', 'int64')
session.get_index(pids, apids, wfkey)
# join
fkey = session.get(assessments['fkey'])
a_counts = session.create_numeric(patients, 'a_counts', 'uint32')
session.join(pids, fkey, aggregated_counts, a_counts)
Session.process provides scalability in the case that even a few individual fields are too large to fit in memory. Given a set of readers and writers, it internally iterates over subsets of readers and writers, applying a predicate that you supply to be run on each chunk in turn
def multiply_by_2(foo, bar):
bar[:] = foo * 2
foo = session.get(file['foo'])
bar = session.create_numeric(file, 'bar', 'uint32')
session.process({'foo': foo}, {'bar': bar}, multiply_by_2)
It is often a sensible approach to write derived values to a second datastore. This approach allows you to easily separate your derived results from standard data and saves space as your initial datastore can be used for multiple implementation
with h5py.File(args.source, 'r') as src:
with h5py.File(args.dest, 'w') as dest: # recreate dest from scratch
...
with h5py.File(args.source, 'r') as src:
with h5py.File(args.dest, 'r+') as dest: # recreate dest from scratch
...