Skip to content

Intermediate Examples

Ben Murray edited this page May 7, 2021 · 13 revisions

Intermediate Examples (0.5 onwards)

Thinking in numpy

ExeTera provides a rich and growing set of operations that can be carried out directly on Fields rather than having to fetch the underlying data. There are still some times that you may need to fetch the underlying data directly. If you need to do so, it is important to understand that the underlying arrays returned by data are numpy arrays (excepting for indexed string fields which returns a list of strings). The most important thing to understand when working with the underlying data is that it is in a numpy ndarray and all of the good practices of working with numpy arrays applies. In particular, performance is massively affected by dropping out of numpy for any reason (iterating loops explicitly, for example).

Combining filters - not recommended

df # a dataframe containing the fields of interest

foo = df['foo']
bar = df['bar']


# terrible way - very slow as we are are reading each element from storage in turn

result = np.zeros(len(foo), dtype=bool)
for i in len(foo):
  result = foo.data[i] == 1 and bar.data[i] is False
df.create_numeric('result', 'bool').data.write(result)


# better but still slow as explicit iteration in numpy is discouraged

result = np.zeros(len(foo), dtype=bool)
foo_ = foo.data[:]
bar_ = bar.data[:]
for i in len(foo):
  result[i] = foo[r] == 1 and bar[r] is False
df.create_numeric('result', 'bool').data.write(result)

Combining filters - recommended

# we can make use of fields directly rather than fetching the underlying numpy arrays
# we recommend this approach in general

df # a dataframe containing the fields of interest

foo = df['foo']
bar = df['bar']
df['result'] = (foo == 1) & (bar == False)

# or just

df['result'] = (df['foo'] == 1) & (df['bar'] == False)


# fetching numpy arrays

foo_ = df['foo'].data[:]
bar_ = df['bar'].data[:]
df.create_numeric('result', 'bool').data.write((foo_ == 1) & (bar_ == False))

There are still circumstances in which it may be better to fetch the underlying numpy arrays. One such example is if you are checking the same value multiple times:

df # a dataframe containing the fields of interest

# one read of foo from storage

foo_ = df['foo'].data[:]
result = np.where(np.logical_or(foo_ == 4, foo_ == 3), True, False)
df.create_numeric('result', 'bool').data.write(result)


# two reads of foo from storage

df['result'] = (df['foo'] == 4) | (df['foo'] == 3)

Filtering

Filtering is performed through the use of the apply_filter function. This can be performed on individual fields or at a dataframe level. apply_filter applies the filter on data rows.

Filter a dataframe

Note, this operation is destructive. It will overwrite the contents of the existing dataframe in storage.

df = # get a dataframe from somewhere

# apply a filter to the dataframe

filt = df['foo'] > 4
df.apply_filter(filt)

Filter a dataframe, preserving the source dataframe

This operation creates a new dataframe and writes the filtered fields to it.

ds = # get a dataset from somewhere
df_foo = ds['foo']
df_bar = ds.create_dataframe('bar')
df_foo.apply_filter(df_foo['foobar'] > 4, df_bar)

Sorting

Putting your data into the most appropriate order is very important for scaling of complex operations. Certain operations in ExeTera require that the data is presented in sorted order in order to be able to run correctly:

  • ordered_merges
  • generation and application of spans

Changing sorted order can be done in one of two ways:

  1. session.sort_on
  2. session.dataset_sort_index followed by DataFrame.apply_index or Field.apply_index

Either way, you must specify how you want the fields to be sorted. This is done through selecting the fields to be sorted on. You can select one or more fields and the fields will be applied in order. For both methods, you can specify one or more fields on which the data should be sorted (for example, 'user_id' and 'entry_date').

Sorting with sort_on

Session.sort_on is provided for when you want to sort all of the fields in a dataframe. You can sort in-place or you can sort and add the resulted sorted fields to a destination dataframe

Sorting in place

Note: sorting in place is a destructive operation, as each dataframe is backed up by a dataset and this gets changed when the sorted order changes. You may prefer to write the sorted data to a new dataframe instead

    # sort in place
    ds = # a dataset from somewhere
    session.sort_on(ds['foo'], ds['foo'], ('a_key_name',))

Sorting to another dataframe

    # source to a destination group
    ds = # a dataset from somewhere
    ds.create_dataframe('bar')
    session.sort_on(ds['foo'], ds['bar'], ('a_key_name'))

Sorting with dataset_sort_index

When sorting with dataset_sort_index we first get the permutation of the current indices to the sorted order. We can then apply this to each of the fields that we want to reorder, as follows

    ds = # a dataset from somewhere

    index = session.dataset_sort_index((ds['foo']['a'],))

    # apply indices in place

    ds['foo'].apply_index(index)


    # apply indices to a destination dataframe

    ds.create_dataframe('bar')
    ds['foo'].apply_index(index, ds['bar'])

Joining / merging

ExeTera provides functions that provide pandas-like merge functionality on DataFrame instances. We have made this operation as familiar as possible to Pandas users, but there are a couple of differences that we should highlight:

  • merge is provided as a function in the dataframe unit, rather than as a member function on DataFrame instances
  • merge takes three dataframe arguments, left, right and dest. This is due to the fact that DataFrames are always backed up by a datastore and so rather than create an in-memory destination dataframe, the resulting merged fields must be written to a dataframe of your choosing.
    • Note, this can either be a separate dataframe or it can be the dataframe that you are merging to (typically left in the case of a "left" merge and right in the case of a "right" merge
  • merge takes a number of optional hint fields that can save time when working with large datasets. These specify whether the keys are unique or ordered and allow the merge to occur without first checking this
    • merge has a number of highly scalable algorithms that can be used when the key data is sorted and / or unique.
ds = # a dataset fetched from somewhere
left = # a dataframe
right = # another dataframe
dest = ds.create_dataframe('merged')
merge(left, right, dest, left_on='a_key_in_left', right_on='a_key_in_right', how='left')