-
Notifications
You must be signed in to change notification settings - Fork 1.6k
pattern db
The pattern.db module contains wrappers for databases (SQLite, MySQL), Unicode CSV files and Python's datetime. It offers a convenient way to work with tabular data, for example retrieved with the pattern.web module.
It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.
A database is a collection of tables. A table has rows of data with a specific data type (e.g., string, float) for each field or column. A database engine provides an interface to the database, using SQL statements (Structured Query Language). Python 2.5+ comes bundled with the SQLite engine. The MySQL engine requires the MySQL-Python bindings. Note that a 32-bit Python requires a 32-bit MySQL.
The Database()
constructor creates (if
necessary) and returns an SQLITE
or
MYSQL
database. With SQLITE
, it will create a file with the given
name in the current folder.
db = Database(
name,
host = 'localhost',
port = 3306,
username = 'root',
password = '',
type = SQLITE
)
db.type # SQLITE | MYSQL
db.name # Database name.
db.host # Database host (MySQL).
db.port # Database port (MySQL).
db.username # Database username (MySQL).
db.password # Database password (MySQL).
db.tables # Dictionary of (name, Table)-items.
db.relations # List of relations, see Database.link().
db.query # Last executed SQL query.
db.connected # True after Database.connect().
db.connect() # Happens automatically.
db.disconnect()
db.create(table, fields=[])
db.remove(table)
db.link(table1, field1, table2, field2, join=LEFT)
db.execute(SQL, commit=False)
db.commit()
db.escape(value) # "a cat's tail" => "'a cat\'s tail'"
-
Database.execute()
returns an iterator of rows for the given SQL query. -
Database.commit()
commits the changes of pendingINSERT
,UPDATE
,DELETE
queries. -
Database.escape()
safely quotes and escapes field values.
Database.create()
creates a new table
in the database, It takes a table name and a list of row fields, where
each field is defined with the field()
function. Each field has a name
(a-z +
underscores) and a type
, with an
optional default
value for new rows.
The pk()
function can be used for
primary keys.
field(name, type=STRING, default=None, index=False, optional=True)
pk(name='id') # field('id', INTEGER, index=PRIMARY, optional=False)
*Type* | *Value* | *Example* |
`STRING` | `str`, `unicode` (1-255 characters) | `u'Schrödinger'` |
`INTEGER` | `int` | `42` |
`FLOAT` | `float` | `3.14159` |
`TEXT` | `str`, `unicode` | `open('file.txt').read() ` |
`BLOB` | `str` (binary, e.g., PDF, PNG) | `db.binary(open('img.jpg',` `'rb').read())` |
`BOOLEAN` | `bool` | `True`, `False` |
`DATE` | `Date` | `date('1999-12-31 23:59:59')` |
A STRING
field can contain up to a 100
characters. The length (1-255) can be changed by calling STRING
as a function, e.g., type=STRING(255)
. For longer strings, use
TEXT
. The default value for a DATE
field is NOW
.
With index=True
, the field is indexed
for faster search. The index can also be set to UNIQUE
(no duplicates) or PRIMARY
. A table must have a primary key
field that uniquely identifies each row (i.e., an id). Integer primary
keys are auto-numbered, there is no need to set the value manually in
new rows.
With optional=True
, the field is
allowed to contain None
.
>>> from pattern.db import Database, field, pk, STRING, BOOLEAN, DATE, NOW
>>>
>>> db = Database('my_stuff')
>>> db.create('pets', fields=(
>>> pk(),
>>> field('name', STRING(80), index=True),
>>> field('type', STRING(20)),
>>> field('tail', BOOLEAN),
>>> field('date_birth', DATE, default=None),
>>> field('date_created', DATE, default=NOW)
>>> ))
>>> db.pets.append(name=u'Schrödinger', type='cat', tail=True)
>>> print db.pets.rows()[0]
(1, u'Schrödinger', u'cat', True, None, Date('2013-12-11 10:09:08'))
Database.create()
can also take a Table.xml
or Query.xml
. It creates a new table and copies
the row data in the given XML string. An optional name
parameter can be used to rename the new
table. In Query.xml
, a field name may
contain a period. It will be replaced with an underscore (e.g.,
pets.name → pets_name). Alternatively, an alias can be defined in the
Query.aliases
dictionary.
A Table
is a list of rows, with one or
more fields (i.e., table columns) of a certain type (i.e., string or
number). A new table can be created with Database.create()
. A TableError
is raised if a table with the
given name exists. An existing table can be retrieved with Database.tables\[name\]
, Database\[name\]
or Database.<name>
.
table = Database.tables[name]
table.db # Parent Database.
table.name # Table name (a-z + underscores).
table.fields # List of field names (i.e., columns).
table.schema # Dictionary of (field, Schema)-items.
table.default # Dictionary of (field, value)-items for new rows.
table.pk # Primary key field name.
table.count() # Total number of rows (len(table) also works).
table.rows() # List of rows, each a tuple of fields.
table.record(row) # Dictionary of (field, value)-items for given row.
table.append(fields={}, commit=True)
table.update(id, fields={}, commit=True)
table.remove(id, commit=True)
table.filter(*args, **kwargs)
table.search(*args, **kwargs)
table.xml # XML string with the table schema and rows.
table.datasheet # Datasheet object (see below).
-
Table.rows()
returns a list of all rows. To iterate rows memory-efficiently, useiter(``Table)
. -
Table.append()
,update()
andremove()
modify the table contents.
Withcommit=False
, changes are only committed afterDatabase.commit()
(= faster in batch). -
Table.filter()
returns a subset of rows with a subset of fields.
For example:table.filter('name',
type='cat')
.
The Table.schema
dictionary contains
field name → Schema
items.
schema = Table.schema[fieldname]
schema.name # Field name.
schema.type # STRING, INTEGER, FLOAT, TEXT, BLOB, BOOLEAN, DATE
schema.length # STRING field length.
schema.default # Default value.
schema.index # PRIMARY | UNIQUE | True | False
schema.optional # True or False.
>>> from pattern.db import Database
>>>
>>> db = Database('my_stuff')
>>>
>>> print db.pets.fields
>>> print db.pets.schema['name'].type
>>> print db.pets.schema['name'].length
['id', 'name', 'tail', 'date_birth', 'date_created']
STRING
80
Table.append()
adds a new row with the
given field values. It returns the row id, if the table has a primary
key generated with pk()
. Field values
can be given as optional parameters, a dictionary or a tuple. Field
values for a BLOB
field must be wrapped
in Database.binary()
.
>>> db.pets.append(name=u'Schrödinger', date_birth=date('2009-08-12'))
>>> db.pets.append({'name': u'Schrödinger', 'date_birth': date('2009-08-12')})
>>> db.pets.append((u'Schrödinger', 'cat', True, date('2009-08-12')) # in-order
Table.update()
updates values in the
row with the given primary key. A batch of rows can be updated using a
filter, or a chain of
filters with any()
or all()
. In the last example, all rows with
type='cat'
will have their tail
field set to True
.
>>> db.pets.update(1, type='cat') # set type='cat' in row with id=1.
>>> db.pets.update(1, {'type': 'cat'})
>>> db.pets.update(eq('type', 'cat'), tail=True)
Table.remove()
removes the row with the
given primary key:
>>> db.pets.remove(1)
>>> db.pets.remove(ALL)
>>> db.pets.remove(all(eq('type', 'cat'), lt(year('date_birth'), 1990, '<')))
The last example removes all rows that have type='cat'
AND year of birth before 1990.
Table.filter()
returns a list of rows
filtered by field value(s), where each row is a tuple of fields. The
first parameter defines which fields to return. It can be a single field
name, a list of field names or ALL
. The
following parameters are optional and define field constraints. They can
also be given as a dictionary:
>>> db.pets.filter('name') # all rows, name
>>> db.pets.filter(('id', 'name')) # all rows, name + id
>>> db.pets.filter(ALL, type='cat') # type='cat', all fields
>>> db.pets.filter(ALL, type=('cat', 'dog')) # type='cat' OR type='dog'
>>> db.pets.filter(ALL, type='*at') # type='cat' OR 'hat' OR 'brat', ...
>>> db.pets.filter(ALL, type='cat', tail=True) # type='cat' AND tail=True
>>> db.pets.filter('id', {'type': 'cat', 'tail': True})
More complex queries can be constructed with a Query
.
Table.search()
returns a new Query
with options for filtering, sorting and
ordering rows by field value(s). It can include fields from other,
related tables.
query = Table.search(
fields = ALL,
filters = [],
relations = [],
sort = None,
order = ASCENDING,
group = None,
function = FIRST,
range = None
)
query.table # Parent Table.
query.fields # Field name, list of field names, or ALL.
query.aliases # Dictionary of (field name, alias)-items.
query.filters # List of filter() objects.
query.relations # List of rel() objects.
query.sort # Field name or list of field names.
query.order # ASCENDING | DESCENDING
query.group # Field name or list of field names.
query.function # FIRST, LAST, COUNT, MIN, MAX, SUM, AVG, CONCATENATE
query.range # (start, stop)-tuple, e.g. rows 11-20.
query.sql() # SQL string, can be used with Database.execute().
query.rows() # List of rows, each a tuple of fields.
query.record(row) # Dictionary of (field, value)-items for given row.
query.xml # XML string with the query schema and rows.
To iterate rows memory-efficiently, use iter(Query)
instead of Query.rows()
.
The filter()
function creates a
field-value constraint that matches certain rows in a table. A list of
filters can be passed to the filters
parameter of a Query
.
filter(field, value, comparison='=')
*Comparison* | *Description* | *Example* | *Alias* |
`=` | equal to | `filter('type',` `('cat',` `'dog'),` `'=') ` | ` eq()` |
`i=` | equal to (case-insensitive) | `filter('name',` `'tig*',` `'i=') ` | ` eqi()` |
`!=` | not equal to | `filter('name',` `'*y',` `'!=')` | ` ne()` |
`>` | greater than | `filter('weight',` `10,` `'>') ` | ` gt()` |
`<` | less than | `filter('weight',` `10,` `'<') ` | ` lt()` |
`>=` | greater than or equal to | `filter(year('date'),` `1999,` `'>=') ` | ` gte()` |
`<=` | less than or equal to | `filter(year('date'),` `2002,` `'<=')` | ` lte()` |
`:` | between (inclusive) | `filter(year('date'),` `(1999,` `2002),` `':')` | ` rng()` |
The field name of a DATE
field can be
passed to the year()
, month()
, day()
, hour()
, minute()
or second()
function.The short aliases of filter()
have a preset comparison operator.
Filters can be chained together. The all()
function returns a list with AND logic.
The any()
function returns a list with
OR logic. In the example below, the first query matches all cats named Taxi. The
second and third query match any pet that is cat OR that
is named Taxi.
all(filter1, filter2, ...) # Rows must match ALL of the filters.
any(filter1, filter2, ...) # Rows must match ANY of the filters.
>>> from pattern.db import Database, eq, all, any
>>>
>>> db = Database('my_stuff')
>>>
>>> db.pets.search(filters=all(eq('name', 'Taxi'), eq('type', 'cat')))
>>> db.pets.search(filters=any(eq('name', 'Taxi'), eq('type', 'cat')))
>>> db.pets.search(filters=any(name='Taxi', type='cat'))
Lists created with all()
and any()
can be nested to define complex search
criteria. The example below matches all pets that are cats, and whose
name starts with Fluff- OR ends with a -y:
>>> f = any(eq('name', 'Fluff*'), eq('name', '*y')) # OR
>>> f = all(eq('type', 'cat'), f) # AND
>>>
>>> for row in db.pets.search(filters=f):
>>> print row
The syntax can even be more concise:
>>> for row in db.pets.search(filters=all(name=('Fluff*', '*y'), type='cat')):
>>> print row
The rel()
function defines a relation
between two fields in different tables (usually id's).
rel(field1, field2, table, join=LEFT) # LEFT | INNER
The optional join
parameter defines how
rows are matched. LEFT
takes all rows
from the base table, with additional fields from the related table. For
a row with no match between field1
and
field2
, these fields have value None
. INNER
takes the subset of rows that have a match between field1
and field2
.
A well-known example is a database app that processes invoices. Say we have a products table and an orders table. Each order has a product id – instead of all product details. Each product id can occur in multiple orders. This approach is called database normalization. It avoids duplicate data. To generate an invoice, we can combine product details and order details using a query relation.
The following example demonstrates a simple products + customers + orders database app:
products
id
name
price
1
pizza
15
2
garlic bread
3
customers
id
name
1
Schrödinger
2
Hofstadter
orders
id
product
customer
1
1
2
>>> from pattern.db import Database, field, pk, INTEGER as I
>>>
>>> db = Database('pizza_delivery')
>>>
>>> db.create( 'products', (pk(), field('name'), field('price', I)))
>>> db.create('customers', (pk(), field('name')))
>>> db.create( 'orders', (pk(), field('product', I), field('customer', I)))
Add products and customers. Pizza delivery is open for business!
>>> db.products.append(name='pizza', price=15)
>>> db.products.append(name='garlic bread', price=3)
>>> db.customers.append(name=u'Schrödinger')
>>> db.customers.append(name=u'Hofstadter')
Hofstadter orders a pizza.
>>> db.orders.append(product=1, customer=2)
An orders query with relations to products and customers generates a human-readable invoice:
>>> from pattern.db import Database, rel
>>>
>>> db = Database('pizza_delivery')
>>>
>>> f = ('orders.id', 'customers.name', 'products.name', 'products.price')
>>> q = db.orders.search(f, relations=(
>>> rel('orders.customer', 'customers.id', 'customers'),
>>> rel('orders.product', 'products.id', 'products'))
>>> )
>>> for row in q:
>>> print q.record(row)
{ 'orders.id' : 1,
'customers.name' : u'Hofstadter',
'products.name' : u'pizza',
'products.price' : 15 }
If a relation is used repeatedly, define it once with Database.link()
. It will be available in
every Query
.
A Query
has an optional parameter group
that can be used to merge rows on
duplicate field values. The given function
is applied to the other fields. It
can also be a list with a function for each field.
*Function* | *Field type* | *Description* |
`FIRST` | any | The first row field in the group. |
`LAST` | any | The last row field in the group. |
`COUNT` | any | The number of rows in the group. |
`MIN` | `INTEGER` + `FLOAT` | The lowest field value in the group. |
`MAX` | `INTEGER` + `FLOAT` | The highest field value in the group. |
`SUM` | `INTEGER` + `FLOAT` | The sum of all field values in the group. |
`AVG` | `INTEGER` + `FLOAT` | The average of all field values in the group. |
`STDEV` | `INTEGER` + `FLOAT` | The standard deviation (= variation from average). |
`CONCATENATE` | `STRING` | Joins all field values with a comma. |
For example, to get the total revenue per ordered product:
>>> print db.orders.search(
>>> fields = ('products.name', 'products.price'),
>>> relations = rel('product', 'products.id', 'products'),
>>> group = 'products.name', # Merge orders with same product name.
>>> function = SUM # Sum of product prices.
>>> ).rows()
A Datasheet
is a matrix of rows and
columns, where each row and column can be retrieved as a list. The data
can be imported or exported as a CSV-file. Optionally, the given fields
is a list of (name,
type)
headers, where type
can be STRING
, TEXT
, INTEGER
, FLOAT
, BOOLEAN
, BLOB
or DATE
.
datasheet = Datasheet(rows=[], fields=None)
datasheet = Datasheet.load(path, separator=',', decoder=lambda v: v, headers=False)
datasheet.rows # List of rows (each row = list of values).
datasheet.columns # List of columns (each column = list of values).
datasheet.fields # List of (name, type) column headers.
datasheet.<field> # List of column values.
datasheet[i] # Row at index i.
datasheet[i, j] # Value in row i at column j.
datasheet[i1:i2, j] # Slice of column j from rows i1-i2.
datasheet[i, j1:j2] # Slice of columns j1-j2 from row i.
datasheet[i1:i2, j1:j2] # Datasheet with columns j1-j2 from rows i1-i2.
datasheet[:] # Datasheet copy.
datasheet.insert(i, row, default=None)
datasheet.append(row, default=None)
datasheet.extend(rows, default=None)
datasheet.copy(rows=ALL, columns=ALL)
datasheet.group(j, function=FIRST, key=lambda v: v)
datasheet.save(path, separator=',', encoder=lambda v: v, headers=False)
datasheet.json # JSON-formatted string.
-
Datasheet.insert()
andappend()
fill missing columns with thedefault
value. -
Datasheet.columns.insert()
andappend()
fill missing rows with thedefault
value.
An optionalfield
parameter can be used to supply a (name
,type
) column header. -
Datasheet.copy()
returns a newDatasheet
from a selective list of row and/or column indices. - To rotate a datasheet 90 degrees, use
datasheet
=
flip(datasheet)
.
For example:
>>> from pattern.db import Datasheet
>>>
>>> ds = Datasheet()
>>> ds.append((u'Schrödinger', 'cat'))
>>> ds.append((u'Hofstadter', 'cat'))
>>> ds.save('pets.csv')
>>>
>>> ds = Datasheet.load('pets.csv')
>>> print ds
[[u'Schrödinger', 'cat'],
[ u'Hofstadter', 'cat']]
Datasheet.group(j)
returns a new Datasheet
with unique values in column j
. It merges rows using a given function
that takes a list of column values
and returns a single value. Predefined functions are FIRST
, LAST
,
COUNT
, MIN
, MAX
,
SUM
, AVG
, STDEV
and CONCATENATE
. It can also be a list
of functions.
The optional key
can be used to compare
the values in column j
. For example,
lambda
date:
date.year
groups a column of Date
objects by year.
>>> from pattern.db import Datasheet, pprint
>>>
>>> ds = Datasheet(rows=[
>>> (1, u'Schrödinger', 'cat'),
>>> (2, u'Hofstadter', 'cat'),
>>> (3, u'Taxi', 'dog')
>>> ])
>>>
>>> g = ds.copy(columns=[2, 0]) # A copy with type & id.
>>> g = g.group(0, COUNT) # Group type, count rows per type.
>>> pprint(g, fill='')
cat 2
dog 1
Datasheet.columns\[j\].sort()
sorts the
rows according to the values in column j
.
Datasheet.columns.sort()
can be used to
change the column order:
>>> ds.columns.sort(order=[0, 2, 1])
>>> pprint(ds, fill='')
1 cat Schrödinger
2 cat Hofstadter
3 dog Taxi
Datasheet.columns.swap(j1,j2)
swaps two
individual columns with given indices.
Datasheet.save()
exports the matrix as
a CSV file. Datasheet.load()
returns a
Datasheet
from a given CSV file. CSV
(comma-separated values) is a simple text format for tabular data, where
each line is a row and each value is separated by a comma.
datasheet = Datasheet.load(path, separator=',', decoder=lambda v: v, headers=False)
datasheet.save(path, separator=',', encoder=lambda v: v, headers=False)
On export, all str
, int
, float
,
bool
and Date
values are converted to Unicode. An
encoder
can be given for other data
types. On import, all values in the datasheet will be Unicode unless a
decoder
is given.
With headers=True
, the Datasheet.fields
headers are exported and
imported (first line in CSV). In this case, the data type for each
column (STRING
, INTEGER
, FLOAT
, BOOLEAN
or DATE
) is explicitly known and no encoder
or decoder
is needed.
>>> from pattern.db import Datasheet, STRING, DATE, date
>>>
>>> ds = Datasheet(fields=(('name', STRING), ('date', DATE)))
>>> ds.append((u'Schrödinger', date('1887-08-12')))
>>> ds.append((u'Hofstadter', date('1945-02-15')))
>>>
>>> ds.save('pets.csv', headers=True)
>>>
>>> ds = Datasheet.load('pets.csv', headers=True)
>>> print ds[0]
[u'Schrödinger', Date('1887-08-12 00:00:00')]
The csv()
function can also be used
instead of Datasheet.load()
:
>>> from pattern.db import csv
>>>
>>> for name, date in csv('pets.csv', separator=',', headers=True):
>>> print name, date
The date()
function returns a new Date
, a convenient subclass of Python's datetime.datetime
. It takes an integer (Unix
timestamp), a string or NOW
. An
optional string input format and output format can be given (e.g., "%d/%m/%y"
). The default output format is
"YYYY-MM-DD hh:mm:ss"
.
d = date(int)
d = date(NOW, format=DEFAULT)
d = date(string)
d = date(string, format=DEFAULT)
d = date(string, inputformat, format=DEFAULT)
d = date(year, month, day, format=DEFAULT)
d = date(year, month, day, hours, minutes, seconds, format=DEFAULT)
d.year
d.month # 1-12
d.week # 1-52
d.weekday # 1-7
d.day # 1-31
d.minute # 1-60
d.second # 1-60
d.timestamp # Seconds elapsed since 1/1/1970.
If no string input format is given, a number of common formats will be tried:
*Format* | *Example* |
`%Y-%m-%d %H:%M:%S` | 2010-09-21 09:27:01 |
`%a, %d %b %Y %H:%M:%S %z` | Tue, 9 Sep 2010 17:58:28 +0000 |
`%Y-%m-%dT%H:%M:%SZ` | 2010-09-20T09:27:01Z |
`%Y-%m-%dT%H:%M:%S+0000` | 2010-09-20T09:27:01+0000 |
`%Y-%m-%d %H:%M` | 2010-09-20 09:27 |
`%Y-%m-%d` | 2010-09-20 |
`%d/%m/%Y` | 20/09/2010 |
`%d %B %Y` | 9 september 2010 |
`%B %d %Y` | September 9 2010 |
`%B %d, %Y` | September 09, 2010 |
All date formats used in
pattern.web
(e.g., Twitter search result) are automatically detected.
For an overview of date format syntax, see:
http://docs.python.org/library/time.html#time.strftime.
Date calculations
The time()
function can be used to add
or subtract time to a Date
:
time(days=0, seconds=0, minutes=0, hours=0)
>>> from pattern.db import date, time
>>>
>>> d = date('23 august 2011')
>>> d += time(days=2, hours=5)
>>> print type(d)
>>> print d
>>> print d.year, d.month, d.day
<class 'pattern.db.Date'>
2011-08-25 05:00:00
2011, 8, 25