AI projects: How Strongbytes defines the suitable approach

According to a 2018 Deloitte report that surveyed early adopters of cognitive technologies, 76% believed that AI tools will “substantially transform” their companies by 2020. Artificial intelligence has the potential to reshape every business, from automating business procedures, to enhancing the decision-making process or providing smarter products and services.

However, not all AI projects turn out to be successful. Actually, data from Gartner shows that 60% of big data projects do not achieve their goal. But Gartner analyst Nick Heudecker mentioned the percentage might literally reach 85. Why? Because companies don’t plan, failing to acknowledge the importance of understanding the business or clearly defining objectives and success metrics.

When working with AI-based projects, the software development lifecycle standards that are already used are no longer enough. Introducing different algorithmic approaches, new data sources or any other cognitive technologies into the equation brings new challenges to the table.

Strongbytes has experience in building and training predictive models. Furthermore, we have proved that our core strength is actually making these models operational, by creating scalable, secure and performant solutions around them. And we managed to do that by following a certain approach that Strongbytes has put together based on the Team Data Science Process (TDSP).

A well-known statement about AI systems says that predictions are only as good as the data that was used to train the model. Therefore, paying a significant attention to understanding, preparing and cleaning the data is a key step.

Courtesy of

Identify data sources and define SMART objectives

Before diving into the project, we usually take the time to analyze the situation, the data we have available or that we might need to gather and come up with a strategy that can guarantee the success of the implementation.

In this case, the first and most important step is to sit down with our clients and identify their business needs and the current challenges they want to tackle with the help of machine learning. Based on this information, we can define the model targets and the metrics linked to them that we can use to see if the project is going in the right direction. As an example, income or expenses forecasts can constitute model targets.

During this step, we also ask a series of questions to understand what data we have available and pinpoint the key data issues. Some of the main points we go through include:

  • Determining whether we have available the right data for fulfilling the AI priorities.
  • Determining if we have the right amount of data and, if not, deciding where could we acquire more data and, if possible, in a more strategic way.
  • Understanding if we need to look for external data sources or update the existing systems to collect new data.
  • Understanding the implications of using certain data and how this might impact users/clients/employees.

Once the key business variables that the analysis needs to predict (model targets) have been identified, we move on to defining the project objectives. Depending on the business goal we want to achieve, there are five types of questions machine learning can help us answer:

  • How much or how many?  In this case, the goal would be to use regression models to predict a continuous value (e.g. predicting prices of a house given features such as size, average area cost, etc. ).
  • Which category?  In this case, we are able to identify to which of a set of categories a new observation belongs to. In machine learning and statistics, this is called classification. An example would be predicting the gender of a person by their handwriting style.
  • Which group? In machine learning, similar examples are usually grouped together as a first step to understand a subject (data set) in a system. This is called clustering. For instance, common applications for clustering include market segmentation, social network analysis and search result grouping.
  • Is this weird? Anomaly detection is used to identify cases that are unusual within data that is seemingly homogeneous (e.g. unusual shopping patterns can help companies identify fraudulent transactions).
  • Which option should be taken? A recommendation system is able to predict the user responses to the options available. Recommender systems are one of the most successful and widespread applications of machine learning technologies in business.
Courtesy of Towards Data Science

Understand the data

The next logical step would be to dig deeper into the data we have at hand. This means that before we actually start training the model, we usually use data summarization and visualization techniques to audit the quality of the data. This involves identifying any missing values or fixing the discrepancies. Clear data helps us make a better decision regarding the predictive model we are going to use for our target. According to Cathy O’Neil who wrote Weapons of Math Destruction, we often don’t understand the data we feed into our systems and data bias is becoming a massive problem. Therefore, data understanding is a mandatory step when working with AI projects.

Build the machine learning model

The principle of the Minimum Viable Product (MVP) also applies when working with machine learning models. The main goals for this step are to refine the data features for building the ML model, create an MVP ML model that predicts the target most accurately, further develop the MVP into a product that is suitable for production and then deploy it.

Determining the features

Raw variables that have been identified in the previous step are now aggregated and transformed resulting into the features used in the analysis. For this, we need to understand how the features relate to each other and how the machine learning algorithms are to use those features.

Training the model

The process for model training usually includes the following steps:

  • Split the input data randomly for modeling into a training data set and a test data set.
  • Build the models by using the training data set.
  • Evaluate the training and the test data set. Use a series of competing machine-learning algorithms along with the various associated tuning parameters (known as a parameter sweep) that are geared toward answering the question of interest with the current data.
  • Determine the suitable solution to answer the question by comparing the success metrics between alternative methods.

Deploy the ML model

This is the step where data infrastructure becomes a core component of the process. In this part, we normally analyze if changes to the ML model are needed and then do them carefully, to avoid affecting the performance. A substantial amount of testing prior to live deployment is something we definitely keep in mind when handling a live ML product. Also, since deploying models means we need to expose them with an open API interface, then we are also considering how we can scale the infrastructure up to support the ML product.

The final step of the process is validating with clients that the model meets their needs and expectations. Additional documentation is prepared and the project is handed-off to the entity responsible for operations.

Building an AI-based project is much related to the existing software development process. However, since the technology is new and the implications are sometimes much higher, it is key to take the time to think things through. What can normally be perceived as obvious or easy in standard software development can turn out to be a lot difficult when machine learning models are involved. Following a certain methodology and investing in understanding the business goal and the data we have available should be the main focus.