Dead Simple Machine Learning Model Serving
There has been rapid progress in machine learning over the past few years. Today, you can grab one of a handful of frameworks, follow some online tutorials, and have a working machine learning model in a matter of hours. Unfortunately, when you are ready to deploy that model into production you still face several unique challenges.
First, there is no standard for model serving APIs, so you are likely stuck with whatever your framework gives you. This might be protocol buffers or custom JSON. Your business application will generally need a bespoke client just to talk to your deployed model. And it's even worse if you are using multiple frameworks. If you want to create ensembles of models from multiple frameworks, you'll have to write custom code to combine them.
Second, building your model server can be incredibly complicated. Deployment gets much less attention than training, so out-of-the-box solutions are few and far between. Try building a GPU version of TensorFlow-serving, for example. You better be prepared to bang your head against it for a few days.
Finally, many of the existing solutions don't focus on performance, so for certain use cases they fall short. Serving a bunch of tensor data from a complex model via a python-JSON API not going to cut it for performance-critical applications.
We created GraphPipe to solve these three challenges. It provides a standard, high-performance protocol for transmitting tensor data over the network, along with simple implementations of clients and servers that make deploying and querying machine learning models from any framework a breeze. GraphPipe's efficient servers can serve models built in TensorFlow, PyTorch, mxnet, CNTK, or caffe2. We are pleased to announce that GraphPipe is available on Oracle's GitHub. Documentation, examples, and other relevant content can be found at https://oracle.github.io/graphpipe.
The Business Case
In the enterprise, machine-learning models are often trained individually and deployed using bespoke techniques. This impacts an organizations’ ability to derive value from its machine learning efforts. If marketing wants to use a model produced by the finance group, they will have to write custom clients to interact with the model. If the model becomes popular sales wants to use it as well, the custom deployment may crack under the load.
It only gets worse when the models start appearing in customer-facing mobile and IoT applications. Many devices are not powerful enough to run models locally and must make a request to a remote service. This service must be efficient and stable while running models from varied machined learning frameworks.
A standard allows researchers to build the best possible models, using whatever tools they desire, and be sure that users can access their models' predictions without bespoke code. Models can be deployed across multiple servers and easily aggregated into larger ensembles using a common protocol. GraphPipe provides the tools that the business needs to derive value from its machine learning investments.
GraphPipe is an efficient network protocol designed to simplify and standardize transmission of machine learning data between remote processes. Presently, no dominant standard exists for how tensor-like data should be transmitted between components in a deep learning architecture. As such it is common for developers to use protocols like JSON, which is extremely inefficient, or TensorFlow-serving's protocol buffers, which carries with it the baggage of TensorFlow, a large and complex piece of software. GraphPipe is designed to bring the efficiency of a binary, memory-mapped format while remaining simple and light on dependencies.
- A set of flatbuffer definitions
- Guidelines for serving models consistently according to the flatbuffer definitions
- Examples for serving models from TensorFlow, ONNX, and caffe2
- Client libraries for querying models served via GraphPipe
In essence, a GraphPipe request behaves like a TensorFlow-serving predict request, but using flatbuffers as the message format. Flatbuffers are similar to google protocol buffers, with the added benefit of avoiding a memory copy during the deserialization step. The flatbuffer definitions provide a request message that includes input tensors, input names and output names. A GraphPipe remote model accepts the request message and returns one tensor per requested output name. The remote model also must provide metadata about the types and shapes of the inputs and outputs that it supports.
First, we compare serialization and deserialization speed of float tensor data in python using a custom ujson API, protocol buffers using a TensorFlow-serving predict request, and a GraphPipe remote request. The request consists of about 19 million floating-point values (consisting of 128 224x224x3 images) and the response is approximately 3.2 million floating point values (consisting of 128 7x7x512 convolutional outputs). The units on the left are in seconds.
Graphpipe is especially performant on the deserialize side, because flatbuffers provide access to underlying data without a memory copy.
Second, we compare end-to-end throughput using a Python-JSON TensorFlow model server, TensorFlow-serving, and the GraphPipe-go TensorFlow model server. In each case the backend model is the same. Large requests are made to the server using 1 thread and then again with 5 threads. The units on the left are rows calculated by the model per second.
Note that this test uses the recommended parameters for building Tensorflow-serving. Although the recommended build parameters for TensorFlow-serving do not perform well, we were ultimately able to discover compilation parameters that allow it to perform on par with our GraphPipe implementation. In other words, an optimized TensorFlow-serving performs similarly to GraphPipe, although building TensorFlow-serving to perform optimally is not documented nor easy.
Where Do I Get it?
You can find plenty of documentation and examples at https://oracle.github.io/graphpipe. The GraphPipe flatbuffer spec can be found on Oracle's GitHub along with servers that implement the spec for Python and Go. We also provide clients for Python, Go, and Java (coming soon), as well as a plugin for TensorFlow that allows the inclusion of a remote model inside a local TensorFlow graph.
View original content: Here
Related Oracle News