ANN: Shipyard 0.02

Sun Oct 19 19:07:59 CEST 2008

I'm happy to announce version 0.02 of the Shipyard python module
<http://www.florian-diesch.de/software/shipyard/>

=================
What is shipyard?
=================

Shipyard is a module to process data in a format inspired by email 
headers (RFC 2822).

The goal of shipyard is to have a simple, human readable and human writable
replacement for CSV that works better for long data and many rows and
doesn't need difficult escaping rules for special characters.

It's called ``shipyard`` because that word contains ``py`` and doesn't
seem to be taken yet.

===========
File format
===========

Character encoding
==================

A character encoding can be specified similar to :pep:`0263` using::

  # -*- coding: <encoding name> -*-

in the first line. ``#`` is replaced with the actual `comment`_ mark.

More precisely, the first  line must match the regular
expression::

  ^#.*coding[:=]\s*([-\w.]+)

Again ``#`` is replaced by the actual `comment`_ mark.  The first group
of this expression is then interpreted as encoding name.

Data set
========

A *data set* consists of zero or more `records <#record>`__  separated 
by one or more empty lines.

Comment
=======
Lines starting with the *comment mark* (default: ``#``) are 
ignored. Comments can be used in or between `records <#record>`__.

Record
======
A *record* consists of one or more `fields <#field>`__

Field
=====

A *field* is a line that has the form::

   key: value

*key* is a string that
   - doesn't contain  a colon
   - doesn't start with the `comment`_ mark
   - doesn't start with the `continuation`_ mark

*value* is an arbitrary string. It can span multiple line using
`continuation`_ marks.

Continuation
============
If a line starts with the *continuation mark* (default: " " [one blank])
it gets appended to the preceding line, with the 
continuation mark removed.

=====
Usage
=====

Obviuosly we need to import shipyard:
   >>> import shipyard

First we open the file:
   >>> input = open('nobel.sy')

Then we create a parser object:
   >>> reader = shipyard.Parser(keep_linebreaks=False,
   ...                          keys=['id', 'discipline', 'year',
   ...                                'name', 'country', 'rationale'])

For every record the given keys  are initialized with None.

Now we can iterater through the records:

   >>> for record in reader.parse(input):    # doctest:+ELLIPSIS
   ...     print record['country']
   United States
   Japan
   United States
   ...

Instead of iterating we may want to get a list of dicts:
   >>> input.seek(0)
   >>> lod = reader.get_list(input)
   >>> print lod     # doctest:+ELLIPSIS
   [{u'discipline': u'Chemistry', u'name': u'Martin Chalfie', ...}, {u'discipline': u'Chemistry', u'name': u'Osamu Shimomura', ...}, ...]

Sometimes we need a dict of dicts (using the 'id' field as key):
   >>> input.seek(0)
   >>> dod = reader.get_dict(input, key='id')
   >>> print dod.keys()
   [u'11', u'10', u'1', u'0', u'3', u'2', u'5', u'4', u'7', u'6', u'9', u'8']
   >>> print dod[u'5'][u'rationale']
   for the discovery of the mechanism of spontaneous brokensymmetry in subatomic physics

If we don't want dicts we can use the 'factory' parameter:
   >>> input.seek(0)
   >>> los = reader.get_list(input, factory = lambda **keys: ', '.join(keys.values()))
   >>> print los[0]
   Chemistry, Martin Chalfie, United States, for the discovery and development of the green fluorescentprotein, GFP, 2008, 0

Of course a class works as a factory, too:
   >>> input.seek(0)
   >>> class Laureate(object):
   ...     def __init__(self, id, discipline, year, name, country, rationale):
   ...         self.name = name
   >>> doo = reader.get_dict(input, key='id', factory = Laureate)
   >>> print doo[u'2']      # doctest:+ELLIPSIS
   <Laureate object at ...>
   >>> print doo[u'2'].name
   Roger Y. Tsien

Now let's write a Shipyard file.

First we create a StringIO (any other file-like object will do, too):
   >>> import StringIO
   >>> output = StringIO.StringIO()

Next we need a Writer object:
   >>> writer = shipyard.Writer(keys=('foo', 'bar'), coding='utf-8')

Now we can use write() to write a single record:
   >>> writer.write(output, {'foo': 1, 'bar': 2})
   >>> print output.getvalue()
   foo: 1
   bar: 2
   <BLANKLINE>
   <BLANKLINE>

Using write_many() we can write a list of records:
   >>> output = StringIO.StringIO()
   >>> d = [dict((('foo', i), ('bar', 2*i))) for i in range(3)]
   >>> writer.write_many(output, d)
   >>> print output.getvalue()
   foo: 0
   bar: 0
   <BLANKLINE>
   foo: 1
   bar: 2
   <BLANKLINE>
   foo: 2
   bar: 4
   <BLANKLINE>
   <BLANKLINE>

To get a  encoding line we use write_coding():
   >>> output = StringIO.StringIO()
   >>> writer.write_coding(output)
   >>> print output.getvalue()
   #-*- coding: utf-8 -*-
   <BLANKLINE>
   <BLANKLINE>

Now let's do everything at once using write_full():
    >>> output = StringIO.StringIO()
    >>> writer.write_full(output, d)
    >>> print output.getvalue()
    #-*- coding: utf-8 -*-
    <BLANKLINE>
    foo: 0
    bar: 0
    <BLANKLINE>
    foo: 1
    bar: 2
    <BLANKLINE>
    foo: 2
    bar: 4
    <BLANKLINE>
    <BLANKLINE>

   Florian
-- 
<http://www.florian-diesch.de/>
-----------------------------------------------------------------------
**  Hi! I'm a signature virus! Copy me into your signature, please!  **
-----------------------------------------------------------------------