Six best practices—moving Python code to production

cancel
Showing results for 
Search instead for 
Did you mean: 

Six best practices—moving Python code to production

Notes from an MLOps Ninja: Six best practices for moving Python code from development to production

In this article, I provide some recommended guidelines to follow when moving code from development environments to production environments. This example uses the Python language, but these guidelines can be easily applied to other languages as well.

Guideline 1: Using Python virtual environments

The first thing to note is that in many cases the development environment is different from the production environment. For example, the development environment in many cases will be the laptop of the data scientist, while the production environment will be a Docker container or an AWS instance. Our first guideline would be to use the same Python virtual environment in both environments. This way, we will be avoiding issues due to different versions of the modules being used by the code. Working with Python virtual environments is very easy and is a great general best practice as well.

Here’s an example of creating a Python virtual environment:

> python -m virtualenv /opt/venv

> . /opt/venv/bin/activate

(venv) > /tmp/pip install numpy==1.17.0

This example will create a Python virtual environment, then activate it, and then install NumPy inside the virtual environment.

Note that in the above example we are installing a specific version of NumPy. Either choose your versions before starting your project or, once you are done building the prototype of your project, use pip freeze to get a snapshot (requirements.txt) of all the packages installed under your virtual environment.

<venv> pip freeze > /tmp/requirements.txt

Once you have the requirements.txt file, you can reinstall all the packages listed in this file by running:

<venv> pip install -r /tmp/requirements.txt

This way you can reconstruct the exact virtual environment in different locations. Make sure that you are using the exact same packages, either on your development or production environments. 

Guideline 2: Thinking about ongoing production integration, from the beginning

Another important thing to remember is that moving code from development to production is not a one time move. It is a periodic event which needs to happen every time you would like to introduce a change to your code. In order to facilitate the move, it is recommended that you enable support for running the code in both environments without modification.

For example, when accessing data on S3, it is suggested that you separate the configuration of the S3 bucket and security credentials from the code itself. Enable support for passing this information to your code as arguments or as a config file. This ensures that the code used in both the developer and production environments have the same S3 access code but with different configuration.

Example of a configuration for running code in two environments:

def read_dataframe_from_s3(s3_info):
   """
   Read CSV data from S3 and return a dataframe
   :param s3_info a dictionary containing s3 information about
   the bucket to read and the credentials to use
   :return:
   """
   …
   return data_frame

# Later in the code we can pass the specific configuration or load
# it via command line args or environment variables
df = read_dataset_from_s3(test_s3_info)

…
df = read_dataset_from_s3(production_s3_info)

 

When writing your code using this kind of configuration separation, the same code will run both under a development environment and production environment. This way no changes need to be done; in addition, you will find that developing (and debugging) in a development environment is easier.

Guideline 3: Creating reusable components

For the next guideline, we dive deeper into the code structure. Let’s work on a case where our code is supposed to read data from a data source (e.g., S3, database), load a model from a pickle file, use this model to generate predictions from the dataset, and then save these predictions to a database.

The flow described above is very common, but will be implemented differently in different use cases. A common way of implementing this type of flow would be to write one big Python function or file containing all the flow described above.

Instead of doing this, my suggestion is to divide your code into components, where each component can be a different Python function (or a class if you want to be a little bit more fancy). This way, you can assemble the different functions into a pipeline that will represent the required prediction flow. My other recommendation then is to separate the pipeline into components which interact with data sources (or destinations) and components that only manipulate the data (e.g., use it to perform predictions). This way, each code component is responsible for performing a small and specific task.

This approach provides two important benefits:

  • Easier to reuse the code components in other pipelines — For example, if we write an read_data_from_s3 component, this component is probably not specific to the pipeline it was originally used in and can be used in other pipelines.
  • Easier for us to improve each component over time and make it more production aware — For example, improving this read_data_from_s3 component to make it more resilient to intermediate errors.
def read_data_from_s3(info):
   . . .
   return df

def load_model(info):
    . . . 
   return model

def run_predictions(data, model):
    …
    return predictions

def save_predictions_to_db(args, predictions):
   . . .


def main:
   . . .
   model = load_model(args)
   data = load_data_from_s3(args)
   Predictions = run_predictions(arg, data, model)
   save_predictions_to_db(args, predictions)
   ...

In the above example, we can see the definitions of all four components (i.e., functions in this case) and how the pipeline was assembled in the main() function.

Once we generate a clean API between functions, it is easier to replace any of the components with a new one. For example, it’s easy to replace read_data_from_s3 with read_data_from_db.

Guideline 4: Hardening and Debugging

Next, it’s important to understand that connectivity to data sources/destinations is not guaranteed and that the performance of such sources is also not always predictable. Making your code resilient to such issues is a key step in moving from experiments to production. This requires modifying the code to be more resilient to intermediate interruptions and adding the ability to reconnect to such data sources and destinations. For example, a remote database server might be under a heavy load such that when our pipeline code tries to access it, our code receives a “Too many connections” exception. Issues like these usually cannot be predicted, but when they happen our code should be able to continue operating.

One good suggestion is to implement retries. For example, create the code to try the operation 10 consecutive times with a random sleep period between retries. This will enable our connector component to overcome intermediate issues due to connectivity or database load. Of course, if the problem persists this may be an issue we cannot handle through retries. In these cases, and after an appropriate number of retries, we should abort the operation (fail our components and pipeline).

Consider the following example: In a potential source component, we can connect to a MySQL database and obtain a dataframe, for example, by using the following code: 

mysql_connection = MySQLdb.connect(host=mydbhost, password="XXXX", user="myself",…)

When the operation completes successfully, it returns a connection object. But in order to make the code safer for production, we can add a loop to retry the connection creation in case of an error:

for attempt in range(0,10):
   try:
       mysql_connection = MySQLdb.connect(...)
       break
    except SomeMySqlExcetion as e:
      print("Attempt {}, got error: {}".format(attempt, e))
        continue

The above example will retry the connection 10 times before giving up.

Guideline 5: Executing for the long term

At this point, we have deployed our pipeline in production and it has been running for some time: maybe weeks, maybe months. One thing to remember is that each run of our pipeline is different, and usually we will be running on a different dataset every time. This means that dataset size might change as well as the time needed to access it. One of the first recommendations here is to monitor the size of your dataset and time to access it.

Why is this important? Well, things are changing and without this kind of information, we are blind to how our data access and storage is behaving. For example, although data access is not failing it can be very slow, or the data can become very big, influencing our runtime environment by consuming much more memory. By monitoring access times and data sizes, we can instrument our code with some checks that will warn us if something is abnormal.

For example, an SQL-to-Dataframe component can issue a system alert in case the access speed for obtaining the data is below some specified performance (such as NN MB/sec, where NN is a parameter to our component that we can configure prior to production deployment).

# Inside a SQL to Dataframe component
start = time.time()

data_size = get_dataframe_from_sql(....)

end = time.time()

total_time = end - start
rate = data_size / total_time
If rate < minimal_rate:

   send_system_alert("slow-sql", "Sql server data rate is {} 
                      MB/sec".format(rate))

 The above example will generate an alert in case the connection speed is slower than a minimum speed. A system alert will not fail the pipeline run, but the operator will get a notification containing the alert information. This way such issues can be tracked and fixed.

Note: In an upcoming article, we will explain how to generate an alert and how to integrate events and alerts into your production code.

Guideline 6: Expecting the unexpected—logging and testing

Our last recommendation is "Prepare for the unexpected." In production, the unexpected will happen. Make sure to save important information to log files so when the unexpected happens you will have enough information to understand what happened. Verify that the information you believe you are printing to the logs actually appears in the logs. It is common to believe that the logs have enough information in them only to discover that this is not true when you need them most.

Summary 

Let’s go quickly over what was covered here:

  • Use Python’s virtual environment for development and production environments.
  • Run code on a development environment that mirrors the production environment.
  • Separate code into multiple components, for example, connectors and algorithms.
  • Retry access to external data sources/sinks in case of intermediate errors.
  • Measure access times and data sizes then use those metrics to determine if performance of data access is changing over time.
  • Prepare for the unexpected; make sure to use logs to output information which will be useful in case of failure.

Using the DataRobot Python client package with the DataRobot platform, see this guide for help getting started.

Labels (2)
Version history
Last update:
‎03-19-2021 07:10 PM
Updated by:
Contributors