This article will briefly present Theano, a machine learning library and introduce it with a small regression problem.

It will describe how to play with Theano to classify a single word between french or english sets using a classifier over 4 features only :

  • The size of the word divided by 30 (let’s assume there is no word exceeding 30 characters)
  • The vowel ratio
  • Maximum consecutive vowels substring length (divided by the word size)
  • Maximum consecutive consonants length (divided by the word size)

After training over a 6.000 french/english words training set it was able to classify a 230.000 words validation set with a 66.8% success rate.

This is a very simple problem suggested by a friend to test Theano. I think it would be possible to get better results by extracting more features on the word and usng a better classifier, but I chose to keep things simple.

A machine learning library :

Theano is a python library defined as a « linear algebra compiler ». It facilitate multidimensional mathematical expression definition, compilation and evaluation. It also provides a way to compute mathematical expressions with transparent use of a GPU.

Let’s study a little example of code to get a little overview of what is possible with Theano. The goal of this example is to define a matrix-addition function.

As we would do in math we first define the type of our variables :

a = tensor.dmatrix('a')
b = tensor.dmatrix('b')

This mean that a and b are two matrices (the constructor parameter is used by the pretty-printer when an exception is raised for example).

Then, we define the function which takes two matrices in parameter and return the sum of the two matrices.

# We first define the input of the function as :
X = [a, b]
# then the output :
Y = a + b
# Then we ask theano to compile the function f: X -> Y
f = function(X, Y)

At this step, we can use f to compute matrices addition, and the evaluation will use the GPU ! Now, let’s see with a simple regression problem.

What is regression ?

Regression problems are common machine learning problems.

The solution of a regression problem is a function (note that in most of cases d’ is equal to 1) :

To find this function, we only know « examples » of the output of f for a finite set of inputs. This set of examples is called « training-set ». A regression algorithm will learn a function which will fit with the training-set and expect to generalize this function for data outside the training set.

There is also a lot of methods to find this function as :

  • Regression trees
  • Neural networks
  • Linear regression

Each of these methods use a different model. A model is an a priori (also called bias) on the set of function that will fit with the one we are looking for. We only train the model to fit with the training set.

It is impossible to design a machine learning algorithm without this a priori. Without this seed, you would not be able to generalize your functions for unseen examples (for more informations see : bias-variance dilemma).

What is linear regression ?

With linear regression you will be able to approximate functions, using a linear transformation of your input vector. It also modelize your function using a line for a 2-dimensional vector, a plane (3-dimensional), a hyperplane…

f, the model which is your approximation of the targeted function F, is defined as:


W is a weight vector (initially random) :

Your function is also a dot-product between your weight vector and the input. You will also try to learn the best weight vector W to find the best linear approximation f of F on the training-set.

Training-set definition :

Your function will learn using several examples. This set of examples is called the training-set (Ξ). Because your input is a d-dimensional vector weight and you have n examples, your training set will also be a n*d-matrix.

For each training example weight vector Xi (the ith line of  Ξ) you have an output Oi = f(Xi) and an expected output Ei = F(Xi).

Computing the error made :

As you can see, for each example you can compute Oi, the dot product between the ith example and the weight vector. The squared error (positive + differentiable) made on the training is defined as :

The idea of linear regression is to :

  1. compute the gradient vector of the error made on the training-set
  2. iterate a step on the weight vector following the gradient vector direction
  3. recompute the gradient vector
  4. Jump to 1 until k iterations

The error gradient vector is defined as :

This vector gives the error direction for each weight value variation. If we follow the opposite direction of this vector, the error will also fall down.

You can easily calculate the partial derivative of each component of the weight vector, but Theano already do the job for you !

Understand the gradient descent

To understand the gradient descent algorithm, I like to visualize the problem in a 2 dimensional instance.

You want to find a 2-dimensional weight vector W to minimize the error. You know the expression of the error (which depends of w1 and w2) and its gradient.

The algorithm will first pick a random position, then it will follow the highest slope direction to fall down while it’s possible. If will also find a local minimum of the error which is potentially a good approximation of the mysterious function F seen through the examples.

You can just imagine that you throw a ball in the graph and that you let the physics happens while the ball is falling. If you’re not satisfied, just throw it again !

Play with Theano

In our case, we need to approximate a function which gives the probability of a word being a french word given its features : P(french | X). Obviously, the probability of a word being an english word would be 1 – P(french | X).

The result of the function is also contained between [0, 1]. With a linear transform, our result is not bounded. We can use a sigmoid function to get the value of the probability for a given result of the linear transformation.

This function is bounded between 0 and 1 and at the origin, its value is 0.5. It is a also a differentiable function, so we will still be able to compute the error gradient.

Define the data :

TS = tensor.matrix('training-set')
W = tensor.matrix('weights')
E = tensor.matrix('expected')

Define the model and its error:

# O is the matrix (column) containing the output for each example
O = tensor.nnet.sigmoid(, W))
# def_err is the squared error sum for each example
def_err = ((E - O) ** 2).sum()

Compile the error function and its gradient :

# Theano is magic
err = function([W, TS, E], def_err)
grad_err = function([W, TS, E], grad(e, W)) # ...and this is very dark magic

Regression algorithm :

precision = 0.001
# Load the dataset containing the feature for each word of the training set
dataset = load_dataset()
# Load the expected output vector E containing 1.0 for french, 0.0 for english for each word of the TS
expected_output = load_expected_outputs()
# Initial random weight vector
weights = scipy.random.standard_normal((4, 1))

for i in range(100):
    err_val = err(weights, dataset, expected_output)
    err_grad_val = grad_err(weights, dataset, expected_output)
    weights -= precision * error_grad_val
    print("Iteration " + str(i) + ", squared error : " + str(err_val)

print("-------- Training result -------")
print("Final squared error : " + str(err(weights, dataset, expected_output)))
print("Computed weight vector :")

Learn more

You will find a great documentation and a lot of tutorials about this library on