[scikit-learn] [Scikit-learn-general] Estimator serialisability

Thu Jul 14 10:20:15 EDT 2016

Thanks Nick!

On Thu, Jul 14, 2016 at 10:18 AM, Nick Pentreath <nick.pentreath at gmail.com>
wrote:

> For PFA, you may wish to check out
> https://github.com/opendatagroup/hadrian/ (the "titus" subproject is a
> full Python impl of PFA, with a focus on some "model producing" hooks such
> as a PrettyPFA higher-level text-based DSL for PFA document construction).
>
>
>
> On Thu, 14 Jul 2016 at 16:07 William Komp <wkomp at smarterhq.com> wrote:
>
>> Hi,
>> Interesting conversation. I have captured model parameters in sql and use
>> sql for scoring in massively parallel setups.  You can score billion record
>> sets in seconds. Works really well with logistic regression and other
>> functional based models.  Trees would be a bit more difficult.
>>
>> Has there been any discussion on PFA (Portable Format for Analytics):
>> http://dmg.org/pfa/index.html incorporation in scikit? Bob Grossman is
>> the driving force behind it. Here is a link to a deck from a Predictive
>> Analytics World talk he gave in chicago a few months ago.
>>
>>
>> http://www.slideshare.net/rgrossman/how-to-lower-the-cost-of-deploying-analytics-an-introduction-to-the-portable-format-for-analytics
>>
>> William
>>
>> On Thu, Jul 14, 2016 at 8:35 AM, Dale T Smith <Dale.T.Smith at macys.com>
>> wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I investigated this subject last year, and have tried to keep up, so I
>>> can perhaps offer some alternatives.
>>>
>>>
>>>
>>> ·         The only packages I know that read PMML in Python are
>>> proprietary. There are several alternatives for writing to PMML, as you can
>>> easily find.
>>>
>>>
>>>
>>> I also found
>>>
>>>
>>>
>>> https://code.google.com/archive/p/augustus/
>>>
>>>
>>>
>>> and
>>>
>>>
>>>
>>> https://github.com/ctrl-alt-d/lightpmmlpredictor
>>>
>>>
>>>
>>> Depending on your project, sklearn-compiledtrees may be an option.
>>>
>>>
>>>
>>> https://github.com/ajtulloch/sklearn-compiledtrees
>>>
>>>
>>>
>>> Py2PMML (
>>> https://support.zementis.com/entries/37092748-Introducing-Py2PMML) is
>>> by Zemantis and it’s a commercial product, meaning you pay for a license.
>>>
>>>
>>>
>>> ·         Another option is what we planned to do at an old job of mine
>>> – read the model characteristics out of the scikit-learn object after fit,
>>> and produce C code ourselves. This is a viable option for decision trees.
>>> Adapt print_decision_trees() from this Stackoverflow answer.
>>>
>>>
>>>
>>>
>>> http://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree
>>>
>>>
>>>
>>> ·         You can also reconsider your use of joblib.dump again. I’m
>>> aware that it has problems, but you can include enough versioning
>>> information in the objects you dump in order to apply checks in your code
>>> to make sure scikit-learn versions are compatible, etc. I know this is a
>>> pain in the neck, but it’s a viable alternative to creating your own PMML
>>> reader, writing a code generator of some kind, or buying a license.
>>>
>>>
>>>
>>>
>>>
>>>
>>> __________________________________________________________________________________________
>>> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
>>> Science and Capacity Planning
>>> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>>>
>>>
>>>
>>> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
>>> macys.com at python.org] *On Behalf Of *Joel Nothman
>>> *Sent:* Thursday, July 14, 2016 4:18 AM
>>> *To:* Scikit-learn user and developer mailing list
>>> *Subject:* Re: [scikit-learn] [Scikit-learn-general] Estimator
>>> serialisability
>>>
>>>
>>>
>>> ⚠ EXT MSG:
>>>
>>> This has been discussed numerous times. I suppose no one thinks
>>> supporting pickle only is great, but a custom dict is unmaintainable. The
>>> best we've got AFAIK (and it looks
>>> <https://github.com/jpmml/jpmml-sklearn/graphs/contributors> like it's
>>> getting better all the time) is a tool to convert one-way to PMML, which is
>>> portable to production environments. See
>>> https://github.com/jpmml/sklearn2pmml (python interface) and
>>> https://github.com/jpmml/jpmml-sklearn(command-line interface and guts
>>> of the thing).
>>>
>>>
>>>
>>> I hope that helps; and thanks to Villu Ruusmann: that list of supported
>>> estimators is awesome!
>>>
>>>
>>>
>>> PS: please write to the new list at scikit-learn at python.org
>>>
>>>
>>>
>>> On 14 July 2016 at 17:24, Miroslav Zoričák <miroslav.zoricak at gmail.com>
>>> wrote:
>>>
>>> Hi everybody,
>>>
>>>
>>>
>>> I have been using scikit-learn for a while, but I have run into a
>>> problem that does not seem to have any good solutions.
>>>
>>>
>>>
>>> Basically I would like to:
>>>
>>> - build my pipeline in a Jupyter Notebook
>>>
>>> - persist it (to json or hdf5)
>>>
>>> - load it in production and execute the prediction there
>>>
>>>
>>>
>>> The problem is that for persisting estimators such as the RobustScaler
>>> for example, the recommended way is to pickle them. Now I don't want to do
>>> this, for three reasons:
>>>
>>>
>>>
>>> - Security, pickle is potentially dangerous
>>>
>>> - Portability, I can't unpickle it in scala for example
>>>
>>> - Pickle stores a lot of details and information which is not strictly
>>> necessary to reconstruct the RobustScaler and therefore might prevent it
>>> from being reconstructed correctly if a different version is used.
>>>
>>>
>>>
>>> Another option I would seem to have is to access the private members of
>>> each serialiser that I want to use and store them on my own, but this is
>>> inconvenient, because:
>>>
>>>
>>>
>>> - It forces me as a user to understand how the robust scaler works and
>>> how it stores its internal state, which is generally bad for usability
>>>
>>> - The internal implementation could change, leaving me to fix my
>>> serialisers (see #1)
>>>
>>> - I would need to do this for each new Estimator I decide to use
>>>
>>>
>>>
>>> Now, to me it seems the solution is quite obvious:
>>>
>>> Write a Mixin or update the BaseEstimator class to include two
>>> additional methods:
>>>
>>>
>>>
>>> to_dict() - will return a dictionary such, that when passed to
>>>
>>> from_dict(dictionary) - it will reconstruct the original object
>>>
>>>
>>>
>>> these dictionaries could be passed to the JSON module or the YAML module
>>> or stored elsewhere. We could provide more convenience methods to do this
>>> for the user.
>>>
>>>
>>>
>>> In case of the RobustScaler the dict would look something like:
>>>
>>> { "center": "0,0", "scale": "1.0"}
>>>
>>>
>>>
>>> Now the bulk of the work is writing these serialisers and deserialisers
>>> for all of the estimators, but that can be simplified by adding a method
>>> that could do that automatically via reflection and the estimator would
>>> only need to specify which fields to serialise.
>>>
>>>
>>>
>>> I am happy to start working on this and create a pull request on Github,
>>> but before I do that I wanted to get some initial thoughts and reactions
>>> from the community, so please let me know what you think.
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Miroslav Zoricak
>>>
>>> --
>>>
>>> Best Regards,
>>> Miroslav Zoricak
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>>> traffic
>>> patterns at an interface-level. Reveals which users, apps, and protocols
>>> are
>>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>>> planning
>>> reports.http://sdm.link/zohodev2dev
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>>
>>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
>>> opening attachments.
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160714/b3608728/attachment-0001.html>