suggestions for VIN parsing

Mon Dec 29 01:50:26 EST 2014

On Sunday, December 28, 2014 5:34:11 PM UTC-6, Vincent Davis wrote:
> 
> [snip: code sample with Unicode spaces! Yes, *UNICODE SPACES*!]

Oh my! Might i offer some suggestions to improve the
readability of this code?

1. Indexing is syntactically noisy, so if you find yourself
fetching the same index more than once, then that is a good
time to store the indexed value into a local variable.

2. The only thing worse than duplicating code which fetches
the same index over and over again, is wrapping the fetch in
casting function (in this case: "int()") OVER and OVER again!

3. I see that you are utilizing regexps to aid in the logic,
and although i agree that regexps are overkill for this
problem (since it could "technically" be solved with string
methods) if *I* had to solve this problem, i would use the
power of regexps -- although i would use them more wisely ;-)

I have not studied the data thoroughly, but just by "grazing
over" the code you posted i can see a few distinct patterns
that emerge from the VIN data-set. Here is a description of
the patterns:

    "\d+n"
    "\d+na"
    "d\d+"
    "du\d+"

and the last pattern being all digits:

    "\d+"

Even though your "verbose-run-on-conditional" would most
likely execute faster, i prefer to write code (when
performance is not mission critical!) in the most readable
and maintainable fashion. And in order to achieve that goal,
you always want to keep the main logic as succinct as
possible whist encapsulating the difficult bits in "suitably
abstracted structures".

    DIVIDE AND CONQUER!

============================================================
 My approach would be as follows:
============================================================

1. Create a map for each distinct set of VIN patterns with
the keys being a two-tuple that represents the low and high
limits of the serial number, and the values being the year
of that range..

    database = {
        'map_NA':{
            (101, 15808): "Triumph 1951",
            (15809, 25000): "Triumph 1952",
            ...,
        },

        'map_N':{
            ...,
        },

        'map_H':{
            ...,
        },

        'map_D':{
            ...,
        },

        'map_DU':{
            ...,
        },
    }

2. Create a regexp pattern for each "distinct VIN pattern".
The group captures will be used to strip-out *ONLY* the
numeric parts! Then concatenate all the regexp patterns into
a single monolithic program utilizing "named groups". (The
group names will be the corresponding "map_*" for which to
search)

    [code stub here] :-P"

3. Now you can write some fairly simple logic.

    prog = re.compile("pat1|pat2|pat3...")
    def parse_vin(vin):
        match = prog.search(vin)
        if match:
            gname = # Fetch the groupname from the match object.
            number = # Fetch the digits from the group capture.
            d = database[gname]
            for k in d:
                low, high = d[k]
                if low <= number <= high:
                    return d[k]
        return None 

While this approach could be "heavy handed", i feel it will
be much easier to maintain and expand. I'd argue that if
you're going to utilize re's, then you should wield the full
power they provide, else, use some other method.

PS: You know you have a Unicode monkey on your back when you
use tools that insert Unicode spaces!

PPS: Hopefully i did not make any stupid mistakes, it's past my
bedtime!