Skip to content

Release Notes v0.5.5

Mario Juric edited this page Aug 28, 2017 · 1 revision

New in this version: user-defined functions

Summary

Sometimes it is useful to be able to execute a function on column(s) returned by a query. For example, given a function {{{equgal}}}, that takes (ra, dec) and converts them to galactic longitude and latitude, one would hope to be able to write:

   SELECT ra, dec, equgal(ra, dec) as (l, b) FROM sdss

and obtain two new columns, l and b, computed on the fly from ra and dec. Or another example, of on-the-fly application of zero-points:

   SELECT psf_inst_mag + zp(mjd_obj) as recalib_mag FROM ps1_det

Or obtaining reddening values from SFD maps:

   SELECT ra, dec, SFD.EBV(l, b) as EBV FROM sdss

This capability, of having '''in-query callables''', has so far been implemented haphazardly in LSD. In this release, we formalize the requirements for in-query callables, as well as allow the users to add their own (which we term '''user defined functions''', or '''UDFs''')

Installing

This release differs from the others in that it requires extra data files for some of the UDFs (those related to SFD reddening maps) to work. Download these from:

http://lsddb.org/data

and place them into LSD's {{{/usr/share/lsd/data}}} directory.

Examples

Let's jump straight into a few example to illustrate how powerful the in-query callables can be, and at the same time show how to use and write them.

Coordinate system conversions

[mjuric@pan src]$ lsd-query 'select sdss.ra, sdss.dec, galequ(l, b) from sdss'
 [1606 el.]# sdss.ra sdss.dec ra dec
 42.14322405   7.23169467  42.14320074   7.23165630
 42.14287171   7.22951768  42.14284840   7.22947931
 42.18076111   7.19101243  42.18073787   7.19097412
 42.16454659   7.24540300  42.16452326   7.24536466
 42.10970259   7.25325441  42.10967925   7.25321597
...

In the above query, the function galequ(ra, dec) computes and returns the equatorial coordinates corresponding to (l, b). Note how the returned columns are named {{{ra}}} and {{{dec}}}.

{{{galequ}}} it is a built-in LSD routine. It is defined in lsd.builtins package (in lsd.builtins.misc module):

def galequ(l, b):
        # Appendix of Reid et al. (http://adsabs.harvard.edu/cgi-bin/bib_query?2004ApJ...616..872R)
        # This convention is also used by LAMBDA/WMAP (http://lambda.gsfc.nasa.gov/toolbox/tb_coordconv.cfm)
        angp = np.radians(192.859508333) #  12h 51m 26.282s (J2000)
        dngp = np.radians(27.128336111)  # +27d 07' 42.01" (J2000) 
        l0   = np.radians(32.932)
        ce   = np.cos(dngp)
        se   = np.sin(dngp)

        l = np.radians(l)
        b = np.radians(b)

        cb, sb = np.cos(b), np.sin(b)
        cl, sl = np.cos(l-l0), np.sin(l-l0)

        ra  = np.arctan2(cb*cl, sb*ce-cb*se*sl) + angp
        dec = np.arcsin(cb*ce*sl + sb*se)

        ra = np.where(ra < 0, ra + 2.*np.pi, ra)

        ra = np.degrees(ra)
        dec = np.degrees(dec)

        return NamedList(('ra', ra), ('dec', dec))

Defining UDFs in user modules

Assuming we have a module named 'blarg', with the following contents:

def neg(x):
        return -x

we can make its functions available to the query by listing it in LSD_USER_MODULES environment variable:

[mjuric@pan src]$ LSD_USER_MODULES=blarg lsd-query 'select ra, neg(ra) from sdss'
 [1606 el.]# ra neg(ra)
 42.14322405 -42.14322405
 42.14287171 -42.14287171
 42.18076111 -42.18076111
 42.16454659 -42.16454659
 42.10970259 -42.10970259
 42.10996057 -42.10996057

Multiple modules, separated by colons, can be specified in LSD_USER_MODULES.

If we wished the contents of the module was not available in the global namespace, but in a submodule, we'd add a {{{lsd_name}}} variable specifying the desired submodule name:

__lsd_name__ = 'mymod'

def neg(x):
        return -x
[mjuric@pan src]$ LSD_USER_MODULES=blarg lsd-query 'select ra, mymod.neg(ra) from sdss'
 [1606 el.]# ra mymod.neg(ra)
157.09519887 -157.09519887
157.31205699 -157.31205699
157.07354502 -157.07354502
157.45721585 -157.45721585
157.52399634 -157.52399634

The !FileTable and Map classes

Two useful classes for in-query callables, Map and !FileTable, are pre-defined by LSD. They can be used to map columns from the query to other values, given a mapping defined by two numpy array (Map) or a file (!FileTable).

For example, given a table of zero-points stored as pickled ndarray in a file named 'zeropoints.ndarray.pkl', we can construct a UDF that returns a zero-point given the MJD of observation of a given exposure as follows:

from lsd.builtins import FileTable
zp = FileTable('recalibs/zeropoints.ndarray.pkl').map('mjd', 'ZP')

The class !FileTable knows how to read FITS binary tables, pickled ndarrays and ColGroups, and text files (comma separated or whitespace delimited; anything that np.genfromtxt knows to parse). The above creates a callable {{{zp}}} that, when passed a value will return the {{{ZP}}} corresponding to that value of {{{mjd}}} in the table.

Assuming the above is stored in a module named 'blarg', we can then query for the recalibrated magnitudes using:

[mjuric@pan src]$ LSD_USER_MODULES=blarg lsd-query 'select mjd_obs, psf_inst_mag + zp(mjd_obs) as recalib_mag from ps1_det'
 [43478 el.]# mjd_obs recalib_mag exp_id
55632.56761964  14.140
55632.56761964  16.104
55632.56761964  16.388
55632.56761964  19.028
55632.56761964  17.459
...

Alternatively, {{{lsd-query}}}'s {{{--define}}} switch could have been used to define the {{{zp}}} UDF without having to write it into a module:

[mjuric@pan src]$ lsd-query --define="zp=FileTable('zeropoints.ndarray.pkl').map('mjd', 'ZP')" 'select mjd_obs, psf_inst_mag + zp(mjd_obs) as recalib_mag from ps1_det'
 [43478 el.]# mjd_obs recalib_mag exp_id
55632.56761964  14.140
55632.56761964  16.104
55632.56761964  16.388
55632.56761964  19.028
55632.56761964  17.459
...

Next, we could have also used the short-hand !FileTable constructor syntax to make the above even less verbose:

[mjuric@pan src]$ lsd-query --define "zp=FileTable('recalibs/zeropoints.ndarray.pkl:mjd:ZP')" 'select mjd_obs, psf_inst_mag + zp(mjd_obs) as recalib_mag from ps1_det'
 [43478 el.]# mjd_obs recalib_mag
55264.39685403  11.492
55264.39685403  13.059
55264.39685403  10.994
55264.39685403  13.497
...

and finally, note that we could read more than one column from the file:

[mjuric@pan src]$ lsd-query --define "zpdata=FileTable('recalibs/zeropoints.ndarray.pkl:mjd:ZP,overlaps_sdss')" 'select mjd_obs, zpdata(mjd_obs), psf_inst_mag + ZP as recalib_mag from ps1_det'
 [43478 el.]# mjd_obs ZP overlaps_sdss recalib_mag
55264.39685403  28.647 1  11.492
55264.39685403  28.647 1  13.059
55264.39685403  28.647 1  10.994
55264.39685403  28.647 1  13.497
...

A couple of things happened here:

  • Using {{{--define}}}, we've defined a UDF named {{{zpdata}}} that returns two columns from the pickled file, {{{ZP}}} and {{{overlaps_sdss}}}, where the argument matches {{{mjd}}}.
  • We've then called it in the SELECT clause, passing it {{{mjd_obs}}}. This resulted in two new result columns, {{{ZP}}} and {{{overlaps_sdss}}} being generated by the UDF and made available to the query.
  • Finally, we've used the column {{{ZP}}} to compute the recalibrated magnitude, by doing {{{psf_inst_mag + ZP}}}

If we only have a two-column text file that maps exp_ids to zero-points, we can write:

[mjuric@pan src]$ lsd-query --define "zp=FileTable('zps.txt', dtype='u8,f4').map()" 'select mjd_obs, psf_inst_mag + zp(exp_id) as recalib_mag from ps1_det'
 [43478 el.]# mjd_obs recalib_mag exp_id
55632.56761964  14.140
55632.56761964  16.104
55632.56761964  16.388
55632.56761964  19.028
55632.56761964  17.459
...

Note how when no arguments were given to {{{FileTable.map}}}, it defaults to the first column for the key, and second for the value. In fact, we could omit the {{{.map()}}} call alltogether:

[mjuric@pan src]$ lsd-query --define "zp=FileTable('zps.txt', dtype='u8,f4')" 'select mjd_obs, psf_inst_mag + zp(exp_id) as recalib_mag from ps1_det'
 [43478 el.]# mjd_obs recalib_mag exp_id
55632.56761964  14.140
55632.56761964  16.104
55632.56761964  16.388
55632.56761964  19.028
55632.56761964  17.459
...

Where are User Defined Functions and in-query callables defined?

The callables accessible from within a query may be defined in a number of places:

  • The database:: Database directories listed in LSD_DB will be searched for files named user.py. Their contents will be loaded and made available in the global namespace of each query, unless a top-level {{{lsd_name='name'}}} variable is defined, in which case the contents of that file will be loaded into a submodule 'name'. If a same name appears in more than one file, the one appearing earlier in LSD_DB path takes precedence. It is envisioned that callables stored in $LSD_DB/user.py files are database-specific and of use to all users of the given database.
  • User modules:: The LSD_USER_MODULES environment variable will searched for colon-separated list of modules to be loaded and imported into the query namespace. It is envisioned that the user will store their personal classes and routines here.
  • Built-in functions:: There are a number of built-in functions and classes that are made available within a query. These reside in lsd.builtins package. They include !FileTable, Map, ffitskw, BLOB, OBJECT, and more. In addition to these, the numpy module is imported and made available in namespaces {{{np}}} and {{{numpy}}}. Defined with {{{--define}}} or {{{-D}}} switches of {{{lsd-query}}}:: {{{lsd-query}}} allows the definition of UDFs on the command line. This is useful when defining a function includes one-time initialization (see the example with !FileTable).

Requirements for In-Query Callables

A callable that is to be called from within a query must:

  • accepts at least one numpy ndarray as its input (a query column), and returns an object with the same number of rows as the input
  • be pickleable
  • return either:
    1. a single ndarray. This will be interpreted as one column.
    2. a tuple of arrays. These will be interpreted as multiple columns.
    3. a !NamedList instance, constructed from list of (name, column) tuples. These will be interpreted as multiple columns, with the given default names. The default names will be used unless overridden with the AS statement in SELECT clause. Note that when !NamedList is iterated over, it will only return the columns, not the names; this allows for things such as 'SELECT galequ(*equgal(ra, dec)) FROM ...' to still function.
    4. a structured ndarray. The columns within the structured array will be interpreted as multiple columns, with their names kept, unless they're overridden by the AS statement in SELECT clause.

LazyCreate and "heavy" UDFs

Some functions may use significant resources, typically memory. An example would be one performing reddening lookups in SFD'98 maps; it may do so by loading the entire map (~64M) to memory, perfoming the lookup, and returning the value. Since loading is expensive, an optimal implementation would perform it only once, on initialization. A typicall design pattern for this is a class whose constructor performs the load, and a !call operator that does the lookup:

class DustMap:
    def __init__(self, fn):
        ... load the file ...
    def __call__(self, l, b):
        ... perform the lookup ...

# Initialization
EBV = DustMap('ebv.fits')

# Usage
ebv = EBV(l, b)

The callable {{{EBV}}} as defined above may also be used as an UDF in a query. If the user knows they're going to be using it, this is a fine implementation.

However, if one is constructing a general purpose ''library'' of UDFs, an implementation such as the one above becomes problematic. For, note that even if the UDF (in this case, {{{EBV}}}) is never actually used in a query, it will still get constructed and the map loaded to memory. One could tolerate this if there was only {{{EBV}}}, but there more maps we could provide in a library - e.g. temperature, emissions in various bands, etc. These would quickly consume a significant fraction memory, even if the user uses none (or just a few) of these functions.

The solution is to delay construction of {{{DustMap}}}-like objects, until the first time they're accesses. This is where {{{LazyCreate}}} class (defined in {{{lsd.utils}}} module) comes in. The only thing required is to replace the {{{EBV}}} definition above with:

from lsd.utils import LazyCreate
EBV = LazyCreate(DustMap, l, b)

!LazyCreate will instantiate the !DustMap object the first time {{{EBV}}} is accessed (called), and forward all calls/accesses to the newly created object. If {{{EBV}}} is never used, there will be no time/memory penalty associated with its existence. If it is used, it will be auto-created without the user ever knowing (except for the time it takes to construct the object on first use). !LazyCreate allows you to provide libraries of "heavy" UDFs, while still living up to the "you don't pay for what you don't use" creed.

Registering UDFs from Python code

If your're using LSD's Python API, you can make any python callable, class, or module, available within the query by registering it with {{{DB.register_udf}}}. Example:

def neg(x):
        return -x

db=DB('db')
db.register_udf(neg, name='negative')

db.query('SELECT ra, negative(ra) FROM sdss')
...

If {{{name}}} is left unspecified, the name of the object will be used (it's {{{name}}} attribute).