Expediting drug discovery with Machine Learning

We know that development of new drugs is a time-consuming and costly process. In order to ensure both the patients’ safety and drug effectiveness, prospective drugs must undergo a competitive and long procedure, thus significantly increasing costs. Researchers test millions of chemical compounds, however, only a handful progress to preclinical or clinical testing. The abundance of massive data sets and advanced algorithms have led to an increase of the use of computational tools in the early stages of drug development in recent decades. Machine learning (ML) approaches have been of special interest, since they can be applied in several steps of the drug discovery methodology, such as prediction of target structure, prediction of biological activity of new ligands through model construction, discovery or optimization of hits, and construction of models that predict the pharmacokinetic and toxicological (ADMET) profile of compounds. We are convinced of the efficiency of ML techniques combined with traditional approaches to study medicinal chemistry problems. Some ML techniques used in drug design are: support vector machines, random forest, decision trees and artificial neural networks. Usage of artificial intelligence techniques can not only reduce the amount of time it takes for a trial to be conducted, but also to get approval, meaning a drug can be placed on the market as quickly as possible. This can result in cost savings, more treatment options and more affordable therapies for those who need access to the medicine in question.

Our vision: The early stages of deploying AI and ML in life sciences are very promising. We want to support the pharmaceutical innovation process. We can take over the data processing and data science activities from pharma and cosmetics companies and shorten the time between product’s concept, development and final approval. We want to prove that without exploiting the full potential of ML and AI, the life science industry will not be able to meet the society’s expectations. 

How we can help: We can build and train ML models and provide software infrastructure capable of processing thousands of variables 

Support vector machine

Support vector machines (SVM) is a type of machine learning algorithm that can be used for classification and regression tasks. They build upon basic ML algorithms and add features that make them more efficient at various tasks.

SVM main applications include anomaly detection, handwriting recognition, and text classification. Their characteristics are: flexibility, high performance, computing efficiency  and ability to learn with fewer training examples.

Support vector machines are among supervised machine learning algorithms, which means they need to be trained on labeled data. Read more

Decision trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Some advantages of decision trees are:

  • Simple to understand and to interpret. Trees can be visualized.
  • Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. 
  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  • Able to handle both numerical and categorical data. 
  • Able to handle multi-output problems. Read more

Random forest

Random forests are machine learning ensembles composed of multiple decision trees (hence the name “forest”). The main problem with decision trees is that they don’t create smooth boundaries between different classes unless you break them down into too many branches, in which case they become prone to “overfitting,” a problem that occurs when a machine learning model performs very well on training data but poorly on novel examples from the real world.  This can be overcome by using random forests. This ensures that a machine learning model does not get caught up in the specific confines of a single decision tree. Read more

Аrtificial neural networks

The core component of Аrtificial neural networks (ANNs) is artificial neurons. Each neuron receives inputs from several other neurons, multiplies them by assigned weights, adds them and passes the sum to one or more neurons. Some of them might apply an activation function to the output before passing it to the next variable. 

Artificial neural networks are composed of an input layer, which receives data from outside sources (data files, images, hardware sensors, microphone…), one or more hidden layers that process the data, and an output layer that provides one or more data points based on the function of the network. For instance, a neural network that detects persons, cars and animals will have an output layer with three nodes. A network that classifies bank transactions between safe and fraudulent will have a single output. Read more