The Transformer architecture revolutionized the world of Natural Language Processing. Hybridization with Quantum Computing may give it an additional boost on the long road toward the Holy Grail of understanding

In recent years, a novel neural network architecture called Transformer first introduced in the ground-breaking paper Attention is all you need [arXiv:1706.03762] revolutionized the analysis of sequential data, with particular focus on Natural Language Processing tasks such as Machine Translation or generation of text starting from human prompt. A generalization of the Transformer architecture led a researcher to applications in various fields such as image generation (an image is a sequence of pixels after all).

However, the main issue with this marvellous kind of neural networks is that the appalling size of parameters (in the order of hundreds of billions


Using the PennyLane Quantum Machine Learning library it’s easy to create a quantum/digital-hybrid equivalent of the glorious LSTM layer.

In recent years, Toronto-based startup company Xanadu has introduced a python framework called PennyLane which allows users to create hybrid Quantum Machine Learning (QML) models. While it’s still too early to claim that quantum computing has taken over, there are some areas where it can give an advantage as for example in drug discovery or finance. One field that so far has been poorly explored in QML is Natural Language Processing (NLP), the sub-field of Artificial Intelligence that gives computers the ability to read, write and to some extent comprehend written text.

As documents are usually presented as sequences of…


Thoughts and Theory

Quantum Annealers are a class of quantum computers that can help solving NP-hard and NP-complete problems. Here’s an example with practical implications for social networks, recommendation systems and more.

A graph is a data structure composed of a set of nodes connected by edges. Graphs are everywhere: they can represent a network of friendship, the connection between factories and stores, airports, and so on and so forth. Among the many operations that one can apply on graph to extract useful information (in itself a giant rabbit hole), probably the most obvious one is partitioning, i.e. the separation of N nodes into K groups based on some similarity or distance criteria. …


The reality of Unidentified Aerial Phenomena seems out of question, but their true nature is up for debate. For once, let’s put the cart in front of the horses, we may still learn something about how gravity works.

Photo by Philipp Berg on Unsplash

What if Unidentified Aerial Phenomena (UAPs) are not only real, but indeed less than ordinary objects, yet not necessarily artifacts such as spacecrafts, or requiring other otherworldly explanation? Can we make any sense of those sightings? As often the case in the news, a good starting point is to frame the conversation using what former intelligence officer Luis Elizondo calls the “five observables”:

  1. Anti-gravity lift
  2. Sudden and instantaneous acceleration
  3. Hypersonic velocities without signatures (e.g. vapour trails and sonic booms)
  4. Low observability, or cloaking
  5. Trans-medium travel (e.g. from air to water)

Although this subject is very poorly studies in Academic settings…


Hands-on Tutorials

Classification with multiple categories is a common problem in Machine Learning. Let’s dive into the definition of the most commonly used loss function.

One of the most prominent tasks at which Machine Learning has been historically very good at is classifying items (e.g. images, documents, sounds) into different categories. Especially in recent years, advancements in the hardware capable of executing mathematical models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs, LSTMs) made it possible to perform a quantum leap (sometimes literally) in performance. However, defining a model is only half of the story. …


A simple and robust method to compare the effectiveness of classification models should not be made complicated by too many different definitions.

One of the most common applications of Machine Learning is to classify entities into two distinct, non-overlapping categories. Over the years, several methods have been devised, ranging from very simple to more complex to almost a black-box. One common question that comes up when dealing with almost any kind of model is how to compare the performance of different methods, or different tuning of the parameters. Luckily, in the case of binary classifiers, there is a simple metric that catches the essence of the problem: it’s the Area Under the Curve (i.e.


Matrices full of zeroes are very common, especially in Machine Learning. Here a few things to remember to make calculations more efficient.

Photo by Oskar Yildiz on Unsplash

A matrix is a table of numbers. A number in a matrix is usually called an element and is indexed by two integers representing its horizontal (row) and vertical (column) position. Matrices whose elements are mostly 0’s are called sparse, otherwise dense. It turns out that most large matrices are sparse: e.g., those appearing in language processing (NLP) or recommendation systems, because they represent information about the interaction between entities that are rarely in contact with each other. In most applications, matrices are not just used to hold information but are also processed to extract meaningful results. If most of…


Finding the optimal solution by breaking it down into smaller problems doesn’t have to be hard

Dynamic programming is a general method to solve optimization problems (“find the maximum/minimum/shortest/longest…”) by breaking them down into smaller ones (“divide and conquer”) and keeping track of their solutions (“memoization”) to make a more efficient use of information and resources. Memoization can be seen as a form or recursion. If you find it confusing, it’s probably because you don’t understand why it’s called “dynamic” in the first place. It turns out that the name is deliberately misleading for a historical reason.

A Simple Example

Let’s say you want to calculate 6 x 2. …


Analyzing graphs to find optimal paths doesn’t have to be hard

A graph is a data structure composed of a set of objects (nodes) equipped with connections (edges) among them. Graphs can be directed if the connections are oriented from one node to another (e.g. Alice owes money to Bob), or undirected if the orientation is irrelevant and the connections just represent relationships (e.g. Alice and Bob are friends). A graph is said to be complete if all nodes are connected to each other. A directed graph with no loops is said to be acyclic. A tree is an undirected graph in which any two nodes are connected by exactly one…


How a statistical test used to uncover tax frauds may help us find an explanation to a long-standing puzzle in particle physics

Humans like looking for patterns, be it shapes in the clouds or relationships among numbers. We are probably evolved to do so for survival, and just can’t help. In science, pattern recognition helps researcher to tell the shape of galaxies or to identify the decay of short-lived fundamental particles such as top quarks. Especially in physics, scientists have been scratching their heads for years trying to come up with an explanation of the apparently random distribution of the mass of fundamental particles or at least those that belong to the theory of the Standard Model of Particle Physics. Is there…

Riccardo Di Sipio

NLP Machine Learning engineer at Ceridian. Previously smashing protons at the CERN LHC. Views are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store