Making an In-Browser Object Detector with TensorFlow.js

May 6, 2018

What is TensorFlow?

TensorFlow is one of the most popular Machine Learning libraries in the world right now. It’s a data flow based framework developed by Google Brain.

There are three main crowds that use Machine Learning:

  1. Researchers
  2. Data Scientists
  3. Developers

Previously, these groups were not able to use the same tools to collaborate with each other on Machine Learning projects. TensorFlow was the solution created to help solve this problem. The library was built to scale, to run on multiple CPUs/GPUs. It has several wrappers in different languages.

TensorFlow.js

TensorFlow.js is a new JavaScript rendition of the TensorFlow. Now, in the browser, you can train neural networks or run pre-trained models. This opens up several new possibilities, especially for training on real-time data streams directly from users. In this demo we’ll build a simple web application that can detect objects you show your webcam.

Last year, the Google Brain team developed Deeplearn.js, which allowed developers to build ML models in the browser. But the code was deprecated and buggy, so TensorFlow.js was recently released as the upgrade to this project. It’s faster, syntax is cleaner, and has new low level functions that let developers build very detailed models in the browser.

If we build an AI app in Tensorflow.js, all predictions and training occurs entirely client side! It can use the GPU of whatever user accesses the app. It can be any kind of GPU (i.e. your phone or laptop).

Typically, to consume ML models (as a dev or user), we’d have to install packages or deal with technical issues. But because TensorFlow.js works in the browser, we don’t have to install any dependencies.

The library consists of two packages:

TensorFlow.js opens up the potential for using client side data to help train models. This would mean a whole new world of ML, where server sent events can be used to feed rich data into these models in real time, while operating in the browser.

We can create apps that continuously learn. And even if users only have a small amount of data to give, we can use the library’s ability to perform transfer learning to augment existing models’ capabilities with new data.

The data never leaves the client, so it’s privacy friendly too! Can train our models on data client side without ever seeing the actual data itself. There’s a lot of potential to use TensorFlow.js to create web apps that learn from their users.

TensorFlow.js Basics

Before we begin, let’s run through the core components of TensorFlow at a high level.

Deep learning models are basically graphs of computations. We have some input data (e.g. an image or piece of text). All of this data can be represented by a series of numbers. We feed that input data into a computing graph. At each step, we perform a series of mathematical operations on that input data, slowly transforming it.

We call the data that flows through the graph tensors. A tensor is a set of values in the form of a multidimensional array.

The tensors flow through the graph, eventually producing an output, which is a prediction that it makes. In TensorFlow, tensors are our primary unit of data. Each tensor has a shape attribute that defines the array shape. We can create tensors of any size, but we just have to define how many dimensions we’d like, as well as the values that make up the tensor.

Here’s an example of a Tensor instance:

const shape = [3, 3]; // the shape of this tensor will be 3 rows x 4 columns

// this is the tensor constructor function
const tensorA = tf.tensor(
  [2.0, 3.0, 4.0, 5.0, 11.0, 7.0, 6.0, 15.0, 12.0],
  shape
);

// Here's what tensorA looks like:
// [[2.0, 3.0, 4.0],
//  [5.0, 11.0, 7.0],
//  [6.0, 15.0, 12.0]]

Tensors are immutable. Once created, we cannot change their values. Instead, we can perform operations on them, generating new tensors.

So where is the mutability when building a ML model?

That’s where variables come in. Variables are initialized with a tensor of values. Unlike tensors, their values are mutable. We can assign a new tensor to an existing variable using the assign method. They’re used to store and update values during model training. But remember, tensors are only half the equation here. We need to perform operations on them.

// usage of variables
const initialVals = tf.zeros([3]); // [0, 0, 0]
const biases = tf.variable(initialVals); // initialize biases -> [0, 0, 0]

const updatedVals = tf.tensor([0, 3.0, 4.0]);
biases.assign(updatedVals); // update biases --> [0, 3.0, 4.0]

What do we do with variables? We need to perform operations on them!

Operations allow us to manipulate the tensors. As touched on before, TensorFlow models are represented as data flow graphs. Operations are the nodes on these graphs which represent units of computation. They can be as simple as addition or subtraction and as complex as a multivariate equation. Each operation inputs a tensor and outputs a tensor as well.

Example data flow graph with two inputs, an add operation, and an output.


The above diagram can be translated to the following code:

const a = tf.constant(5);
const b = tf.constant(3);
const c = a * b;

The TensorFlow library offers several operations to allow us to perform fast matrix math on tensors easily. It’s important to note that the API is chainable, so you can perform operations on the result of other operations, on the result of operations, on the result of… you get the point.

Installation

All you have to do is add a script tag to your HTML!

<html>
  <head>
    <!-- Load TensorFlow.js -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.10.0"> </script>
    <!-- ... -->
</html>

Building a Model

We have two options for building a model in TensorFlow.js:

(1) Use Operations directly to represent the work that the model does (low level part of the API), defining in detail every add, subtraction, or multiplication operation that’s applied to our tensors. In what order, how often, etc.

// Using operations directly

// Defining function
// example: y = a*x^2 + b*x + c
const predict = input => {
  return tf.tidy(() => {
    const x = tf.scalar(input);
    const axSq = a.mul(x.square());
    const bx = b.mul(x);
    const y = axSq.add(bx).add(c);

    return y;
  });
};

// Defining constants: a = 4, b = 3, c = 2
// y = 4*x^2 + 3*x + 2
const a = tf.scalar(4);
const b = tf.scalar(3);
const c = tf.scalar(2);

// Predict output for a given x input (e.g. of 5)
const solution = predict(5); // Output is 117

(2) Quickly prototype without much concern for the details, we can use the high level API tf.model to construct a model out of layers.

// Quickly prototyping a model
const model = tf.sequential();

model.add(
  tf.layers.simpleRNN({
    units: 20,
    recurrentInitializer: 'GlorotNormal',
    inputShape: [60, 5]
  })
);

const optimizer = tf.train.sgd(LEARNING_RATE);
/* LEARNING_RATE defines how fast we want to update our weights
if the learning rate is too big our model might skip the optimal solution
if it's too small, we might need too many iterations to converge on the best results */

model.compile({ optimizer, loss: 'categoricalCrossentropy' });
model.fit({ x: data, y: labels });

Now it’s time for the fun part – having an AI predict what object is on our webcam.

We’ll use the popular Machine Learning real-time object detection model, YOLO (You Only Look Once). In a nutshell, here’s how it works:

(1) It divides an image into a grid of 13x13 cells. Each cell is responsible for predicting five bounding boxes. A bounding box describes the rectangle that encloses the object.

13x13 grid drawn on input image

(2) The model outputs a confidence score that tells us how certain it is that the predicted bounding box actually encloses some object. Once it does that, it looks something like this…

(3) For each bounding box, the cell predicts a class for what it thinks that object is, based on a probability distribution over all the possible classes. It does this after having been trained on a dataset that contains a set of labeled image classes.

(4) The confidence score for the bounding box and the class prediction are combined into one final score that tells us the probability that this bounding box contains a specific type of object.

(5) Only worry about the boxes that have a confidence score >=30%. This leaves us with a final prediction.

Predicted classes assigned to bounding boxes with confidence score above threshold

YOLO’s architecture uses a Convolutional Neural Network. Basically it has several layers repeated over and over again. We give the network a single input image, and in a single pass, it outputs a tensor that describes the bounding boxes for the grid cells. Then, all we need to do is compute the final scores for the bounding boxes and get rid of the ones that aren’t above our threshold.

For this demo we’re using Michael Shi’s lightweight version of this model, tfjs-yolo-tiny, which uses fewer layers, but is less accurate than we could be with the heavy duty model. In the code block below we import this model, access our webcam to retrieve the image frame. Then we apply each image to our yolo model in real time, and have the bounding box layered on top of the image for us to see. The model is already trained, so we don’t need to train it; it’s ready for use in the browser.

import yolo, { downloadModel } from 'tfjs-yolo-tiny';

const model = await downloadModel(); // async download function
const inputImg = webcam.capture();
const boxes = await yolo(inputImg, model);

// Display detected boxes
for (box of boxes) {
  const { top, left, bottom, right, classProb, className } = box;
  drawRect(left, top, right - left, bottom - top, `${className} ${classProb}`);
}

Here’s what the object detection model looks like when it detects things:

Assigned classes and their confidence percentages

Conclusion

I’m really hopeful for what TensorFlow.js is going to do for JavaScript and the web. Currently my ML knowledge is pretty basic, but I can only imagine TensorFlow.js’s potential after digging a few layers deeper (pun intended).

And there we go! I know that was a lot to digest, especially for those without Machine Learning background, but just note these three main takeaways:

  1. TensorFlow.js is the JavaScript version of Google’s TensorFlow, a popular Machine Learning library which consists of a (a) low level Core API and (b) high level Layers API for building and training models.
  2. It uses the user’s GPU for both training and inference, which is revolutionary for opportunities in training real-time data.
  3. We can use pre-built Machine Learning models in the TensorFlow.js framework and repurpose them for browser use with ease.

The code for this demo can be found here.

Resources: