### Building an ML System for Continual Learning

In setting out to build Salient we unwittingly assigned ourselves a rather challenging task: designing a system that allows non-technical users to use machine learning to automate simple business processes (related to documents) without requiring them to understand anything about machine learning or even programming. Our hypothesis:

Many (most?) tasks and challenges faced by business users are specific to their organization or their set of documents so they can’t be addressed by simple pre-built and pre-packaged machine learning models.

Rather we wanted to build a system that continuously learns from the users, their documents, and their tasks.  Even tasks that are amenable to using pre-trained models will generally benefit from being tweaked by end users: nothing is more infuriating than an automated system that can’t be fixed or corrected!

Of course “users” here could be regular business users or it could be IT engineers training and integrating Salient into their infrastructure. In either case, the goal is for Salient to provide flexible, organization-specific machine learning without requiring any intervention from us (its developers) or even data scientists.

The Challenge(s)

If you start thinking about how to do this you’ll realize the challenges involved are significant:

1. Normal users or even IT admins are not data scientists or ML experts so Salient has to be able to manage the plethora of choices and steps required in building a model without user intervention.
2. In particular users don’t want to spend a large amount of time labeling documents or generating training data so the process of training the system has to be fast, smooth and feel like a natural part of their workflow. Most importantly users have to quickly see good results otherwise they will not feel motivated to continue using the system.
3. We don’t know beforehand (or ever) what the users’ documents will look like so Salient has to be able to support a wide range of document formats, languages and also domain-specific terminology and vocabulary.
4. Each client’s task is unique and is not known to us beforehand. The primary use case for Salient is extracting information from or labeling documents but these tasks span a wide range of cases:
• extract specific words or phrases from a sentence based on the meaning of the text
• extract fields from a form based on the visual information (layout of the document)
• label selections of text based on the type of section it is (e.g. clauses in a contract)

In each case we want the system to support a large amount of variability in the input and learn very quickly to generalize to similar documents.

Because of regulatory or security concerns many clients want to run Salient on premise so all the machine learning and other components of the system have to run in a relatively hardware-constrained environment (private corporate clouds are usually expensive) and so be very efficient. In particular, we can’t depend on the presence of GPUs or easily accessible computing power. Our target is to have a single CPU core be able to provide real-time interactive training and inference for a user.

Solving the challenges above has been a very enlightening journey, fraught with painful lessons. In this article I’m not going to dive very deeply into any given problem but rather will focus on the overall architecture we developed to address these problems. Hopefully, I’ll have time later to dig into individual problems and explain some of the interesting things we learned in solving them.

Architecture

This is the architecture for the ML part of Salient (there are several other components involved with document and knowledge management that are not covered here):

Below I’ll explain the components of the architecture and how they help address the problems above. Hopefully, there’s something in here for you to take away in building your own ML system.

Interface

The elaborate architecture above is invisible to the user or even admin of the system. Instead the user interacts naturally with documents in a nice “smart reader” interface that lets them view, search and extract information from documents:

They select text from the document and assign it to fields in a “form” that constitutes the “structured” version of the document (note that both the fields and form are defined by the user — Salient can be taught to extract any kind of data the user wants to extract).

As they do this Salient trains itself automatically in the background, testing and selecting the best models to match the user’s task. When users open a new document Salient can auto-populate the form so the user can evaluate how well the extraction is working and improve it. This interactive process is called “online learning” because the system learns in real-time from user interactions.  The ability to quickly see and correct what Salient is learning greatly speeds up the workflow.

Document Ingestion & Processing

This process warrants a whole article on its own but its largely outside the scope of the current discussion. Suffice it to say that document processing can be expensive. Salient reads documents in a range of format, including scanned images that it OCRs, and then converts them to a standardized format that can be rendered in the interactive browser-based view shown above. Salient doesn’t just extract the text from the document but it also tracks layout information as this is an important feature for the ML extraction models. The text of the document is then enriched with the output of a full NLP pipeline (entities, parts-of-speech, etc..). This enriched version of the document is stored in a series of data stores (mongo db, minio, & elasticsearch) so that it can be indexed and used later on.

Model Infrastructure

The machine learning pipeline in Salient runs without user intervention (though users can tweak it if desired). Everything from building unsupervised models and generating features to training and selecting models runs in an automated, distributed pipeline driven by changes of state to the training data (generated by the user) and the document set. This presents some interesting challenges discussed below.

Feature Generation & Caching

Salient makes very heavy use of unsupervised learning. Since our goal is to let users train the system very quickly and efficiently, the more work we can do to help them up-front, the better. To that end, Salient builds a range of unsupervised models (variants of word2vec and our own specialized embedding models) to capture statistical properties of the text, the layout, the entities in the documents, etc.. These features, along with other data about the documents, change quite frequently: the unsupervised models are retrained automatically as more data is added or the data distribution changes and features for new documents are generated as documents are inserted or modified, etc.

All this is computationally expensive so we want to cache the results in a way that can be quickly leveraged when we start to do supervised learning. To that end, we have developed a smart “Corpus Features” data service, built on mongo and minio, that tracks the state of the document corpus, the features for each document and automatically manages rebuilding unsupervised models as needed. We use mongo because it is very flexible (supporting the dynamical needs of such a system) and fast to write and read from and we use minio as blob storage for the models themselves.

Distributing Models

Salient is designed around a distributed stateless worker-based architecture so systems like “Corpus Features” (which can be embedded in any worker) have to be able to intelligently update any worker if e.g. a model changes. To this end all models (supervised and unsupervised) in Salient are tagged with unique ids and a centralized caching system (redis) is used to track the newest version of a model. If one worker updates a model the other workers detect this change and update their local copy of the model to the newest version. This is particularly important for supervised models because a user interacting with the system will generally be hitting several workers (possibly on different servers) so they must have a synchronized state.

Training

Training in Salient can be triggered by a range of events. Models retrain while users are interacting with the system so that users get immediate feedback on what the models have learned. This happens in dedicated background workers so as to not interfere with the user’s workflow but of course any model changes have to be propagated back to the real-time ML inference workers. This is managed by tracking the latest version of each model in a distributed cache and having all workers reload any stale models automatically.

Model retraining can also be triggered by non-user events. For instance, importing a large amounts of new documents can trigger unsupervised models to retrain (to capture the new data distribution) and this will also trigger the supervised models that use them to retrain. This chain of dependencies is managed by generating unique identifiers (UUIDs) on each model training and storing this in any dependent model so we can easily track dependency chains.

Model Selection

For every kind of task (labeling a document, extracting specific text or table entries, etc) Salient has a range of models that are sensitive to different types of features (semantic, structural, visual, local, etc..) and that also have different predictive power.  When users have less training data simpler models are favored so they can be retrained faster and be less prone to overfitting.

The selection of features, architecture and training parameters for a model is made by testing many possibilities against user generated data (as part of the background training process) and then selecting the best model based on a “validation set”.  This is a (random) subset of the training data that is hidden from the models during training and then used after training to evaluate the performance of each model variant.  Not only does this let Salient select the best combination of model, features and parameters but it also provides some indication of how well the model will perform on unseen data.  Ultimately Salient selects the best few models for each task and averages their results in an “ensemble”.

All this information about how models have been trained and selected is stored in a training database so that admins can evaluate and understand which models are being used.

Some Technicalities

The discussion and the architecture diagram above cover the conceptual components  of the system but not the detailed service or technical architecture.  That might be subject a separate post but, in brief, Salient (and the diagram above) is implemented as a distributed job queue with both real-time and batch workers running various stateless mircro-services.  This architecture runs in docker-compose or kubernetes but has no other dependencies so it can be deployed in any environment (on the cloud, on-premise).  It is, by design, horizontally scalable and fault tolerant.

If you’re interested in learning more about Salient or have some thoughts or questions about the architecture above please don’t hesitate to get in touch!

# Motivation

This post is written to show an implementation of Convolutional Neural Networks (CNNs) using numpy. I have made a similar post earlier but that was more focused on explaining what convolution in general and CNNs in particular are whereas in this post the focus will also be more on implementing them efficiently in numpy by using vectorization.

It is worth pointing out that compared to packages like tensorflow, pytorch etc., using numpy does not offer any benefits in speed in general. In fact if one includes development time, it is most likely slower. The idea of the post is not to advertise the use of numpy for speed but to share the author’s experience of rebuilding the wheel. While the same is often derided, or demonized even, it is only with rebuilding a few wheels that one is able to to invent new machinery. Not to mention, sometime rebuilding the wheel is just plain fun!

We will focus solely on the CNN part of it and the other related aspects like activation function and maxpooling etc will not discussed here as they can be implemented as separate layers acting on the output of the CNN layer. However, we do discuss padding and stride which are inherent to the convolution process.

The code for this post can be found in the repository with most of the code contained in the file CNN.py

Since the goal of the post is to show how to make an efficient implementation of CNNs in numpy, I will not provide all the code in this post but instead explain the idea of what is being done with lots of visualization and only the relevant code. All the code can of course be found in the aforementioned repository.

# Convolutional Neural Networks

## Forward Pass

We will work on 2 dimensional images with multiple channels. For actual images the number of channels is either 1 for monochrome images or 3 (Red/Green/Blue or RGB in short) for colored images. However, a CNN could itself act on the output of a previous CNN and then the meaning of channels is the output of different filters of the previous layer and so the number of  channels can be any positive integer. Nevertheless, below we concretely talk about 3 channels while coloring them RGB to help visualize the process.

The convolution layer has the filter weights as learnable parameters and two hyperparameters – the stride and the padding. These are denoted below as
$$W_{ijkl},s,p$$
which are a 4 dimensional tensor and two scalars respectively.  The first two dimensions of the weights run over the row and column of the convolutional filter whereas the third one runs over the input channels and the fourth one over the output channels. Note that the the stride and padding can be different for each dimension of the image but for notational simplicity we keep them the same below (although in the actual code allows for different values).

We denote the input of a convolution layer by
$$l^0_{I,i,j,k}$$
where the indices are the minibatch, the row, the column and the channel respectively. The output is similarly denoted by
$$l^1_{I,i,j,l}$$
However, note that except for the minibatch index, the others may  have a different range than that of the input ones. This means that the output image sizes and number of channels may be different.

The notion of padding comes from embedding this image in a (possibly) bigger 0-padded array
$$\tilde l^0_{I,P+i,P+j,k}=l^0_{I,i,j,k}$$
where $$P$$ is the padding. The convolution operation is defined as
$$l^1_{I,i,j,l} = \sum_{\alpha,\beta,k} \tilde l^0_{I,si+\alpha,sj+\beta,k} W_{\alpha,\beta,k,l}$$
where $$s$$ is the stride.

In order to understand the above, we visualize this process by taking an “image” of size $$6 \times 8$$ with three channels where we fill in the values sequentially. So the red channel has its first row from 0 to 7, the second row from 8 to 15 and so on till the 6th row going up to 47. The green channel then starts from 48 and fills in the same way.

In the above we plot the channels along the depth direction.

We make a “filter” as having the same structure but a different size. Specifically we take $$5 \times 3$$ size filters. Note that while there are three filters here, one for each input channel, this set will produces just one channel of output. For each channel of the output we will need a separate copy of such three filters.

Now we  consider the effect of the red filter on the red channel.  Before doing that take another look at the expressions for convolutions from above and try to understand what is happening.

In the example in the image below we have taken a padding of 2 in the y-direction and 1 in the x-direction. This means the original image is embedded in the center of a 0-padded array of a bigger size by 4 in the y-direction and 2 in the x-direction. This ensures that the output of convolution is the same size as the input (in the absence of a stride) but we do not have to do that. Padding can be small or perversely bigger.

Now look at the effect of convolution of the red filter on the part of the (larger embedded) image in the box starting at (4,3) and of the same size as the filter that is drawn in solid blue. Each element of the window is multiplied by the corresponding element of the filter and the result is then summed. Note that this involves a row of zero-padded values. This sum is the (4,3) rd element of the convolution of the red channel. If we had not zero-paded we could not even have this entry and the output would be limited to those values that come from convolution filters fitting completely inside the original (non zero-padded) image.

Let’s also look at the stride. The effect of the stride can be seen from the theoretical expression. A stride of 1 is the default and its result is shown above. However, if we were to choose a stride of 2, for instance, then the windows used would be like the dashed-blue windows and we see that they are moved in steps of 2.

The final ouput of one filter of 3 channels is given by summing the convolution on the three channels as shown below.

## Backpropagation

During backpropagation, the layer receives as input the derivative of the loss with respect to the forward pass output $$\frac{\partial L}{\partial l^1}$$ and outputs the derivative of the loss with respect to the forward pass input $$\frac{\partial L}{\partial l^0}$$ . If the weights are learnable then during backpropagation the layer also adjusts the weights according to the weight updation rule. For the purpose of weight updation, one needs the derivative of the loss with respect to the weights $$\frac{\partial L}{\partial W}$$. (Note that we are implicitly considering only updates that depend on the first derivative of the loss but the discussion can be generalized in a straightforward way to update methods that use higher derivative to compute momentum etc.)

The derivative of the loss with respect to the weights is given by
$$\frac{\partial L}{\partial W_{\alpha,\beta,k,m}}=\sum_{I,i,j} \frac{\partial L}{\partial l^1_{I,i,j,m}} \tilde l^0_{I,si+\alpha,sj+\beta,k}$$
and the derivative of the loss with respect to the input is given by
$$\frac{\partial L}{\partial l^0_{I,p,q,l}} = \sum_{\alpha,\beta,m} \left(\frac{\partial L}{\partial l^1} \right)_{I,\frac{p+P-\alpha}{s} , \frac{q+P-\beta}{s},m} W_{\alpha,\beta,l,m} \\ = \sum_{\alpha,\beta,m} \left(\frac{\partial L}{\partial l^1} \right)_{I,\frac{p+P-K+1+\alpha}{s} , \frac{q+P-K+1+\beta}{s},m} \tilde W_{\alpha,\beta,m,l}$$
where
$$\tilde W_{\alpha,\beta,m,l} = W_{K-\alpha, K-\beta,l,m}$$

and $$K$$ is the filter size. All these can be computed using simple algebra and the chain rule of calculus.

We observe that  the derivative of the loss with respect to the input layer is just a convolution of the error with respect to $$\frac{\partial L}{\partial l^1}$$ with a filter that has the indices along the first two directions reversed.

This observation tells us that we can reuse our code for convolution but we still need to massage the data $$\frac{\partial L}{\partial l^1}$$ before we do that. We can first embed the error with respect to the output in an array
$$z_{I,si,sj,m}=\left(\frac{\partial L}{\partial l^1} \right)_{I,i,j,m}$$

Depending on the sign of $$P-K+1$$  either a part of z is embedded in a right zero-padded array y or z is embedded in a left and right zero-padded y. The reasoning for this is not easy to explain and what I had to do was carefully consider all possibilities of padding sizes and work out the right logic. To be sure a padding size of more than half the filter size seems perverse, but the logic has to be worked out to ensure the code doesn’t break if for nothing else. If the reader finds an easy explanation I would be happy to receive the same and update this post. The code for this part is given below.

def _prepare_back(self,der_y):
mb, n1, n2, _ = self.prev_layer.shape
ch = der_y.shape[3]

m1 = max(self.stride_1 * der_y.shape[1], n1)
m2 = max(self.stride_2 * der_y.shape[2], n2)

z = np.zeros(shape=(mb, m1, m2, ch))
y = np.zeros(shape=(
mb, m1 + self.filter_size_1 - 1, m2 + self.filter_size_2 - 1, ch))

z[:,
:self.stride_1 * der_y.shape[1]:self.stride_1,
:self.stride_2 * der_y.shape[2]:self.stride_2
] = der_y

p1_left = self.padding_1 + 1 - self.filter_size_1
p2_left = self.padding_2 + 1 - self.filter_size_2

# i1,i2 are the start positions in z and iy1,iy2 are the start
# positions in y
i1 = i2 = iy1 = iy2 = 0

if p1_left > 0:
i1 = p1_left
else:
iy1 = -p1_left

if p2_left > 0:
i2 = p2_left
else:
iy2 = -p2_left

# size of array taken from x
f1 = z.shape[1] - i1
f2 = z.shape[2] - i2

y[:,
iy1:iy1 + f1,
iy2:iy2 + f2
] = z[:, i1:, i2:, :]
return y

Finally, we have

$$\frac{\partial L}{\partial l^0_{I,p,q,l}}=\sum_{\alpha,\beta,m} y_{I,p+\alpha,q+\beta,m} \tilde W_{\alpha,\beta,m,l}$$
which is just a simple convolution.

# An efficient numpy implementation

This is all very well but implemented naively in the way described above, the process is very slow. This is because there are several loops:

(i) moving a channel specific filter all over a channel (the actual convolution),
(ii) looping over the input channels,
(iii) looping over the output channels.
(iv) looping over the minibatch

Implemented cleverly the code can use not only numpy vectorization but also use the Dense Layer one may have already written (you will find an implementation in the repository mentioned above). While thinking about how to do this I came across a post by Sylvain Glugger where he does something similar and my ideas are inspired by the same. However, there are two differences (i) for his analysis the channel index is the second one, right after minibatch, and the standard image format keeps it as the last one and that is how I have implemented my code and (ii) my implementation of backpropagation is different. To be honest I could not follow his implementation and I (obviously) think mine is cleaner.

Let’s look at the expression for convolution again
$$l^1_{I,i,j,l} = \sum_{\alpha,\beta,k} \tilde l^0_{I,si+\alpha,sj+\beta,k} W_{\alpha,\beta,k,l}$$
We see that we can collapse the indices $$\alpha,\beta,k$$ of the filter $$W$$ into one index $$\mu = \alpha * f_2 * c + \beta * c + k$$ where $$f_2$$ is the range of index $$\beta$$ and $$c$$ is the number of channels. In numpy this is accomplished simply by

W.reshape(-1,W.shape[-1])


We need to correspondingly build a new tensor with the elements that are to be multiplied by these elements of the reshaped $$W$$. To be sure this is an involved process and the output will have redundancies but it will allow speed up and what’s more, looking at the expression

l^1_{I,i,j,m} = \sum_\mu \bar l^0_{I,i,j,\mu} W_{\mu,m}
\label{eq:eff_conv}

we see that it is now simply a dense layer and we can use the same for the forward pass and the updating of weights during the backpropagation. The backpropagation to find the derivative of loss with respect to $$l^0$$ is something that we will address shortly.

The procedure to obtained $$\bar l^0$$ from $$\tilde l^0$$ is straightforward and is best explained by the code that is given below. However, let us first visualize what is happening.

In the image above the original image transformed consists of the elements of the zero padded image rearranged (with duplication) so that each panel (the panels are laid out in the depth direction) contains all the elements that are to be multiplied by on element of the convolution filter. The convolution filter itself is raveled out in the depth direction (and we refer to it as the flattened filter in the image above).  The product of these two are shown next.

The product of transformed image and the flattened filter is shown again and then we show the sum of all these panels. One can verify that the results are the same.

The code to get the aforementioned transformed image from the zero padded image is given below and this is the main point of this post. Again, the code is part of the CNN layer and hence the reference to ‘self’ but the the meaning should be clear.

def _take(self, y, stride=(1, 1)):

stride_1, stride_2 = stride
mb, en1, en2, ch = y.shape

# Till we discuss the minibatch index, all comments are for the first
# image

# Make a 2d array of indices of the top-left edges of the windows
# from which to take elements. These are to be the indices on the
# first channel. This makes the indices the top-left-back end of the
# cuboid to be taken
s1 = np.arange(0, en1 - self.filter_size_1 + 1, stride_1)
s2 = np.arange(0, en2 - self.filter_size_2 + 1, stride_2)
start_idx = (s1[:, None] * en2 * ch + s2[None, :] * ch)

# Make a grid of elements to be taken in the entire cuboid whose
# top-left-back indices we have taken above. This is done only for
# the first of the above cuboids in mind.
# Note the cuboid elements are flattened and will now be along the
# 4th direction of the output
g1 = np.arange(self.filter_size_1)
g2 = np.arange(self.filter_size_2)
g3 = np.arange(ch)
grid = (g1[:, None, None] * en2 * ch + g2[None, :, None] *
ch + g3[None, None, :]).ravel()

# Combine the above two to make a 3d array which corresponds to just
# the first image in a minibatch.
grid_to_take = start_idx[:, :, None] + grid[None, None, :]

# Make and index for the starting entry in every image in a minibatch
batch = np.array(range(0, mb)) * ch * en1 * en2

# This is the final result
res = y.take(batch[:, None, None, None] + grid_to_take[None, :, :, :])
return res

The same code can be used in on the prepared $$\frac{\partial L}{\partial l^1}$$ (see code above) to give a corresponding transformed version to perform the convolution for the backpropagation.

The remaining code is pretty straightforward and can be found in the repository mentioned at the beginning of the post.

### Convolutional Neural Networks using Numpy – Part 1

There are many powerful tools like Keras and Tensorflow out there to make convolutional neural networks (CNNs). However, unless I have opened the hood and peeked inside, I am not really satisfied that I know something. If you are like me read on to see how to build CNNs from scratch using Numpy (and Scipy).

The code for this post is available in my repository

There are many powerful tools like Keras and Tensorflow out there to make convolutional neural networks (CNNs). However, unless I have opened the hood and peeked inside, I am not really satisfied that I know something. If you are like me read on to see how to build CNNs from scratch using Numpy (and Scipy).

I am making this post a multi part post. As of now I have planned two parts. In the first I will only do a two class classification problem on Fashion MNIST restricted to two classes. The images are monochromatic and the CNN will have only one convolutional filter followed by a dense layer. This simplicity will allow us to focus on the details of convolution without worrying about incidental complexity. In the second post I will tackle the full Fashion MNIST dataset (10 classes) and multiple convolution filters. I will also look at the CIFAR-10 dataset which has colored images in 10 classes.

This is going to a technical post and certain familiarity with algebra and calculus is assumed. While I know that one can do a lot with tensorflow without knowing advanced math, I do not see how it is possible to peek under the hood without the same. Algebra is essential in the forward pass and algebra and calculus are useful to compute the derivative of the loss function with respect to the weights of the network which I derive in closed form. These are used to update the weights, something commonly known as back propagation. Readers unwilling to go through the derivation but able to follow the final expressions can still benefit by understanding the final results and implementing (or following my implementation) in numpy.

# Convolution

In this section we will discuss what exactly we mean by convolution in image processing and how it is related to the implementation in scipy. This is useful as scipy implementation is much faster than a naive numpy implementation. In the end we will consider an example where we compute the convolution by hand and by using scipy as a sanity check. For simplicity we will work in 1 dimension for this section but the arguments extend simply to any number of dimensions.

Mathematically a convolution of a function f by a function g is defined as

This definition is useful in physics and signal processing. For instance in electromagnetism g could be the charge density and f (called a Green’s function in that case) would help us compute the potential due to said charge distribution. (See books on electromagnetism like Jackson or particle physics like Peskin and Schroeder for more details).

In image processing this definition is a bit backwards in that for a convolution window g , of length , we would want the convolution at to be the weighted average of such that the value at −/2+ is weighted by (). Thus, for image processing purposes the correct definition is

Note that this will require value of f(x) for negative values of x and if this is put in as it is then numpy will interpret the negative numbers as indexing from the end of the array. Thus, while implementing this in numpy, we need to make sure that the original array is embedded in a bigger 0-padded one and negative indexes are understood appropriately.

In our implementation of CNNs, we will use scipy.convolve as it will be faster than a naive implementation in numpy. So it is useful to understand the way scipy.convolve is related to what we want. Scipy’s convolve is for signal processing so it resembles the conventional physics definition but because of numpy convention of starting an array location as 0, the center of the window of g is not at 0 but at K/2. So scipy.convolve uses the definition

Now, we if reverse the scipy convolution window we have y ->K-y and that makes the integral

Thus we will get the result we want by giving the reversed array of the convolution window to scipy.convolve.

As an example consider the signal and filter given below.

We then implement the convolution by hand and using scipy in the following command

convolve(sig,win[::-1],'same')

The results are plotted below. The reader is advised to convince herself that the results are the same either by eyeballing the graphs or implementing the two methods herself.

We are now ready to implement CNNs using numpy.

# Two-class classification

As our first example we implement a two class classification using numpy. I discuss this in full detail and for the other examples in latter posts I will expect that the reader has understood this example in detail and go a bit faster.

For the data we will used the Fashion MNIST. We get this from Keras as it is easily available but do note that this is the only use of Keras in this post.

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

We then restrict our dataset to just the category 1 and 3.

cond=np.any([y_train_full==1,y_train_full==3],0)X_train_full=X_train_full[cond]
y_train_full=y_train_full[cond]
X_test=X_test[np.any([y_test==1,y_test==3],0)]
y_test=y_test[np.any([y_test==1,y_test==3],0)]y_train_full=(y_train_full==3).astype(int)
y_test=(y_test==3).astype(int)

We then split our training set into a train and validation set

X_train, X_valid = X_train_full[:-1000], X_train_full[-1000:]
y_train, y_valid = y_train_full[:-1000], y_train_full[-1000:]



And we finally standardize the data as Neural Networks work best with data with vanishing mean and unit standard deviation.

X_mean = X_train.mean(axis=0, keepdims=True)
X_std = X_train.std(axis=0, keepdims=True) + 1e-7
X_train = (X_train - X_mean) / X_std
X_valid = (X_valid - X_mean) / X_std
X_test = (X_test - X_mean) / X_std

Our data are monochrome images that come as matrices of dimension L x L (here L=28). For simplicity we have only one convolution layer (we will relax this restriction in later posts) which is a matrix of size K x K (we will take K=3 in our examples) whose weights can be learnt. In addition we have a dense layer which is a matrix of size (L*L) x 1. Note that we have not kept bias terms and the reader can include them as an exercise once she has understood this example.

We now describe the forward pass, the error and the back propagation in some detail.

## Forward Pass

• We get the images and refer to it as the 0-th layer 0
• We embed the image centered into one of size (+,+) with zero padding
• Then we pass it through a convolutional layer and an activation function f1 (that we will take to be Relu). This is the first layer l1.
• Finally we make a dense layer wrapped in a function f2 (which in this case we take to be a sigmoid function). This is the second layer l2.

Note that even though the weights W2 are for a dense layer we have written the indices here without flattening as it has an intuitive interpretation in that it takes each pixel of the convoluted image and makes a weighted sum. Nevertheless, flattening both and doing a numpy dot is more performant and that is what we do in the code below.

For the loss function we take the usual log-loss

where y is the true outcome.

Backpropagation is just a fancy word for saying that all the learnable weights are corrected by the gradient of the loss function with respect to the weights that are being learned. It is straightforward to differentiate the loss function with respect to W1 and W2 using the chain rule.

• The derivative of the loss function with respect to the layer 2 is
• The derivative of the loss function with respect to the dense layer weights is

where we have used the fact that l2 is the output of the sigmoid function s(x) and that s’(x)=s(x)(1-s(x)).

• Similarly, the derivative of the loss function with respect to layer 1 is
• The derivative of the loss function with respect to the convolution filter is

Now we have closed form expressions for the derivative of the loss function with respect to the weights of the convolution filter as well as the final dense matrix so we can update the weights (with a hyperparameter the learning rate) as

That’s it.

That is all that is required to write the implementation of the CNN. Before doing that we first note the loss function and accuracy for initial random weights so as to have a benchmark.

Before training starts, the loss on average can be obtained analytically from the expression for the loss and is 1. The accuracy is 0.5. We will run our code over five epochs and see the loss and accuracy on the test and validation set.

First we write some custom functions for Relu, its derivative and the sigmoid function that we require for the main code and a function to just do the forward pass and compute loss and accuracy to do so on the validation set

def relu(x):
return np.where(x>0,x,0)

def relu_prime(x):
return np.where(x>0,1,0)

def sigmoid(x):
return 1./(1.+np.exp(-x))

def forward_pass(W1,W2,X,y):
l0=X
l0_conv=convolve(l0,W1[::-1,::-1],'same','direct')

l1=relu(l0_conv)
l2=sigmoid(np.dot(l1.reshape(-1,),W2))
l2=l2.clip(10**-16,1-10**-16)
loss=-(y*np.log(l2)+(1-y)*np.log(1-l2))
accuracy=int(y==np.where(l2>0.5,1,0))
return accuracy,loss

Now we write the main part of the code

# learning rate
eta=.001
for epoch in range(5):
# custom code to keep track of quantities to
# keep a running average. it is not shown for clarity.
# the reader can implement her own or ask me in the comments.    train_loss, train accuracy=averager(), averager()

for i in range(len(y_train)):

# Take a random sample from train set
k=np.random.randint(len(y_train))
X=X_train[k]
y=y_train[k]

##### FORWARD PASS ######

# First layer is just the input
l0=X

# Embed the image in a bigger image.
# It would be useful in computing corrections
# to the convolution filter
lt0=np.zeros((l0.shape[0]+K-1,l0.shape[1]+K-1))
lt0[K//2:-K//2+1,K//2:-K//2+1]=l0

# convolve with the filter
# Layer one is Relu applied on the convolution                l0_conv=convolve(l0,W1[::-1,::-1],'same','direct')
l1=relu(l0_conv)

# Compute layer 2
l2=sigmoid(np.dot(l1.reshape(-1,),W2))
l2=l2.clip(10**-16,1-10**-16)

####### LOSS AND ACCURACY #######
loss=-(y*np.log(l2)+(1-y)*np.log(1-l2))
accuracy=int(y==np.where(l2>0.5,1,0))

# Save the loss and accuracy to a running averager
train_loss.send(loss)
train_accuracy.send(accuracy)

##### BACKPROPAGATION #######

# Derivative of loss wrt the dense layer
dW2=(((1-y)*l2-y*(1-l2))*l1).reshape(-1,)

# Derivative of loss wrt the output of the first layer
dl1=(((1-y)*l2-y*(1-l2))*W2).reshape(28,28)

# Derivative of the loss wrt the convolution filter        f1p=relu_prime(l0_conv)
dl1_f1p=dl1*f1p
dW1=np.array([[
(lt0[alpha:+alpha+image_size,beta:beta+image_size]\
*dl1_f1p).sum() for beta in range(K)]\
for alpha in range(K)])

W2+=-eta*dW2
W1+=-eta*dW1
loss_averager_valid=averager()
accuracy_averager_valid=averager()

for X,y in zip(X_valid,y_valid):
accuracy,loss=forward_pass(W1,W2,X,y)
loss_averager_valid.send(loss)
accuracy_averager_valid.send(accuracy)

train_loss,train_accuracy,valid_loss,valid_accuracy\
=map(extract_averager_value,[train_loss,train_accuracy,
loss_averager_valid,accuracy_averager_valid])

# code to print losses and accuracies suppressed for clarity

On my trial run I got the following results. Your’s should be similar

We can see that even this simple model reduces the loss from about 1 to 0.1 and increases the accuracy from 0.5 to about 0.96 (on the validation set).

We can visualize the convolution by plotting a few (standardized) images and their convolution with the original random filter and the final trained one.

We see that the convolution filter seems to have learned to identify the right edges somewhat. It should be better when we train multiple filters but we have kept things simple by having just one filter. (This choice itself is a hyperparameter). We can do one more thing before finishing off part 1 of this post. We can make two more models: 1) We freeze the convolution layer weights and 2) We freeze the dense layer weights. This would help us understand if these layers contribute to the process. This is very easy. For the former we comment out the update line for W1

W2+=-eta*dW2
# W1+=-eta*dW1

and for the latter we comment out the update line for W2

# W2+=-eta*dW2
W1+=-eta*dW1

The resulting loss and accuracy are

and

respectively. It is clear that most of the work is done by the dense layer but we can still get a loss of .5 and an accuracy of .8 by just the convolution layer. Presumably this performance will increase with a larger number of filters. We will explore this in later posts.

### Enemy of the State

Nowadays, I spend a lot of time thinking about system design and architecture. I spoke last year at PyParis about the lessons we learned building Salient and how to get some control of a large and complex system.

Of course,  a system like Salient is always evolving,  usually as a function of requirements and customer feedback (someone once used the analogy of “changing the engine of an airplane in mid-flight”).  In this post however I’m going to focus on a different kind of insight or requirement, one that doesn’t come directly from the customer but rather, indirectly, by challenging our own understanding of software and how it should be designed and managed.

As a former theoretical physicist I’m very familiar with the idea that abstract thinking, and particularly the right abstractions, can have a huge impact in solving practical problems.  In my own career I witnessed how, by recasting and reformulating the same problem with new and better abstractions we were able to achieve hundred-fold performance and accuracy improvements more than once.

So I was very pleasantly surprised when I discovered a paper espousing a similar powerful set of abstractions in the much more general setting of software design.  I learned about “Out of the Tar Pit” in this very beautiful talk by Rich Hickey (which warrants an entire blog post on its own) and have since been slowly digesting it and working through its implications for our software.

State and Control

The central point of the paper is that much of the complexity of large software systems is potentially avoidable (at least it is not theoretically essential) and has its origins in appropriate handling of state and control.

While the problem of state is something many of us have thought a lot about, I was less familiar with the notion of control and particularly that it might not be essential.

Basically software is generally defined in terms of a series of operations that happen in a certain order to generate a result.   The examples they give are very illustrative:

a = b + 3
c = d + 2
e = f/4

While the above lines might appear in a body of code its very clear that we are enforcing an order on them unnaturally.  I’m so used to this that it did not occur to me that this is really an artifact of the implementation rather than an essential feature of the problem.

Programming paradigms like declarative programming challenge this.  When we write an SQL statement or specify a pod spec in Kubernetes we are not specifying a set of operations but rather the desired business logic.  This cleanly separates out the logical constraints (e.g. relationships in a data model) or system specification from the flow of control required to achieve this (which is is generally not at all unique).  Think of the distinction between simply declaring that a field is a Foreign Key versus the code required to enforce this.  When thought of in these terms its rather shocking that we systematically confound these two aspects of system design in standard software development.

Accidental and Essential

The second dimension used in their deconstruction of the software development process is the distinction between essential and accidental complexity.  Essential complexity is, in their language, complexity that is innate to the specification of the problem.  This is to be distinguished from accidental complexity which emerges by thinking about the problem in terms of a specific implementation or paradigm.

In some sense this leads us to a very user-centric view where only user-specified aspects of the problem become truly essential.  For instance nothing about the choice of programming language, database, operating system, hardware, caching mechanism, etc. can be characterized as essential.  All these are subject to change and critical review.

By imagining a fictitious ideal world with no resource constraints they arrive at the following perhaps surprising observations:

1. The only essential state in any system is the state provided as inputs by the users (and only if the requirements imply that this state must be available later on).
2. There is no essential control in the system.  All control is a source of accidental complexity.  This is essentially tantamount to the statement that the entire system can be specified in declarative terms (users don’t care how you enforce the business rules).

Reflecting on the above, we are forced into an embarrassing realization.  Very little of the state we usually associate with a system (the database, the caches, status of various parts of the system, etc) are user-specified.  They are almost all implementation details and hence a source of “accidental” rather than “essential” complexity.  You might object that without those things the system can not reasonably be imagined to function but the important thing here is not to ignore the presence of such accidental complexity but rather to carefully distinguish essential from accidental complexity and to make this distinction the overriding principle in designing the system.

So far I’ve only attempted to heavily paraphrase some parts of the problem statement of the paper.  The ideas in this paper, and their implication for our own (as well as others’) software design can easily fuel several blog posts.  But before stopping I want to spend a moment making the above discussion more concrete, lest the reader be left with the impression that this is a purely academic exercise.

In a complex system like Salient we have a large amount of both state and control.  Documents are uploaded into the system, users annotate and interact with them, machine learning systems analyze them extracting entities, labeling types of sentences, users train the machine learning systems both directly and indirectly, etc.  These interactions are all orchestrated by processes invoked directly by users or by background tasks or even chains of tasks.  The flow of data and control logic spans multiple software components and generally runs in a distributed system composed of several physical servers.

At first glance this all sounds like an essential aspect of a system of this scale and complexity (and part of what makes it non-trivial to manage).  But by stepping back and asking ourselves some simple questions we can unravel a lot of this complexity (with the goal of eventually putting it back together in a much smarter way).

1. What is the absolute minimum amount of data required to rebuild the system from scratch?  Thinking about this question helps a lot: it turns out that the answer to this question corresponds to a very small subset of the data actually stored and used in the system.  The rest is really “accidental state”.
2. What drives the flow of information and processes in the system and is it canonical (i.e. the only way to do it) or just accidental?  Can we move away from user or agent driven processes and towards a more declarative structure, where the software robustly tries to achieve a certain state derived from the above minimal state and a set of business rules?

Thinking about Salient in these terms has helped drive some very nice design decisions on our part.

For instance an object-oriented way of thinking (something criticized heavily in the paper) would encourage us to encapsulate all data related to a document in one place, both the data uploaded by the user (the original document and annotations about it) as well as a huge amount of derived data (various processing and machine learning outputs, etc).  This tends to lead to a rather complex notion of state where essential state information about a document might be distributed over many data stores (for performance and other reasons) leading to a complex system of synchronization.  This is a good example of accidental complexity that feeds not only into the software but also into infrastructure and even system management processes.  Rethinking this with an eye towards accidental versus essential state lets us design a much cleaner version of the system.

Even in aspects of the system that have been built with very careful separations of concerns we are finding that using the above language helps us clarify and refine the design, mitigating potential flaws down the road.

In follow up blog posts I hope to dig more into this paper (and related ideas that we’ve been exploring at Lore) and also more specific examples of design patterns it has inspired.

### Building a high-performance, scalable ML & NLP platform with Python

I just had the opportunity to share some of our experiences building our machine learning and NLP stack at PyParis 2017 today.  I feel like we’ve come a long way and often this involved learning things the hard way so I tried to share the lessons we learned in this talk.  I hope there’s something in here that might be useful for small startups trying to balance complex and changing business requirements with interesting and not entirely standard technical problems.  If you have any thoughts, suggestions or improvements please drop us a line!

Once the recording of the talk is available from PyParis I’ll post a link here.

PyParis2017_talk

### Machine Learning and Knowledge Management

Last Friday, I had the opportunity to present to the DC Chapter of the Knowledge Management Association. I went over different ways artificial intelligence can be applied to help address challenges facing knowledge management. It was a great experience and I learned a lot from the questions and interactions with our local KM experts. Here are the slides: KMA Presentation

### Introducing Xynn

Xynn is our proprietary technology platform, which incorporates our own algorithms as well as cutting-edge artificial intelligence research from academia. It powers all the products on our website and also allows us to quickly generate bespoke solutions.  In this post, I’d like to share a little about Xynn and how it provides a very powerful basis for building natural language solutions.

### Creating Centaurs

The promise of artificial intelligence is to teach computers to understand the world so they become better at helping us interact with it.

In 1997, the artificially intelligent machine Deep Blue defeated the reigning chess champion, Garry Kasparov. Its victory didn’t spell the end of the human endeavour of chess; rather, it heralded in a new era of human-machine synergy. Today’s best chess players are “Centaurs” – human chess experts enhanced and backed by the power of artificial intelligence. Continue reading “Creating Centaurs”