How We Created a Python Development Framework for Non-Developers
Python is the core programming language leveraged by Bluevine, powering our backend, data engineering, data science, QA automation and, even more recently, data analytics.
Within data analytics, we use SQL queries to both fetch and transform data, specifically using a PostgreSQL Database.
Bluevine's holistic banking solutions are designed especially for small businesses, to help them thrive in their fast-paced environment. One way we accomplish this is to allow business owners to know, as quickly as possible, if they are approved for a Bluevine service. To do this, we needed to run a risk assessment on their data, and do so promptly. This proved to be a big challenge when most of our queries ran for too long and competed for resources amongst the few replicas we had.
The queries we used were long (longest reaching over 3,000 lines!) and inefficient in their usage of the database's capabilities, not to mention unreadable and very difficult to update and build upon. With these queries and our need for quick responses, the solution was clear. We could use Python to perform the heavy lifting and the data transformation.
Unfortunately, the analytics team had minimal experience with Python. And migrating some of our most significant operations could lead to bad code without proper care. That bad code would be hard to manage, and may even suffer the same issues as the SQL scripts, putting us back where we started.
We designed the following solution:
- First, provide all engineers with Python training, based on role, specifically with data-oriented 3rd party libraries such as pandas and NumPy.
- Second, design and build a Python framework to enable the team to focus on their work while abstracting away the software engineering principles. This would allow them to execute and test their code, update, or create new processes.
. Designing the framework required us to identify recurring patterns within the SQL scripts and establish them as building blocks that are modular and easy to reuse or extend.
We identified a common structure within the SQL queries we could apply for the scripts:
- Fetch clients based on specific criteria
- Enrich the data from different tables
- Execute logic to make a decision
Many of the scripts used similar queries to fetch clients, so we were able to provide parameterized functions that run ORM queries that provide the same functionality and avoid the code repetition.
For the second step, instead of heavy queries and data transformations within the database, we pull only raw data for the clients and provide generic functions, which heavily employ pandas to enrich the data. We're able to use vectorized operations to perform complex aggregations on the data and the rich data structures within it and transformations, such as transposing and pivoting.
This method yields multiple benefits, both increasing the performance, and freeing up database resources by avoiding inefficient operations.
The logic execution is more tricky since every process checks for different things. To allow flexibility, we provide a loose structure of predefined modules they need to follow. For example, every process needs some modules that follow a certain structure, e.g.: - preprocessing.py - calculations.py - conditions.py - decisions.py
The framework loads these modules and executes entry point functions inside of them in a specific order. This returns a class instance of a Decision. The Decision object contains all of the information and functionality we need to send it back to our client.
The framework was designed to be easy to use, even for the most inexperienced users, yet flexible enough to allow developers to build new functionality for reuse and improvement.
We are happy to announce that we have migrated all of our processes to run on this new framework, and all new processes are created using it. We have already seen improvements since the transition:
- Python's user-friendly nature enables our analysts to learn quickly and to be able to create new rules in a timely manner.
- We have achieved major improvements to process run time overall.
- Our databases are running simpler and more refined queries, overall reducing the load.
- Tracebility and improved observabillity through logs and monitoring libraries.
We are proud to share that our project was successful, but it is important to remember leveraging Python might not be a fit for everyone. When thinking about a similar endeavor, consider the following:
- Do your customers (analysts in this case) possess the required technical aptitude to deliver higher quality code?
- Are you able to juggle the support of the old system and the development of the new framework in regard to company timelines?
- Will this framework serve future requirements or only improve existing processes?