How to build your own gold mine

After more than a year of collaboration, the customer has a functional department that processes business data from around the world.

After several years on various IT projects, I noticed that, in addition to certain unchanging rules, magic still exists in the form of a miraculous re-naming of something old but well polished, which gets the attention of both suppliers and customers. The carrier wave of these trends is based on marketing departments of large companies, or run by ambitious academics. In the case of the currently hot buzzword of Machine Learning (ML) and Artificial Intelligence (AI), it’s probably very similar. There is an opportunity. There is a performance here like never before. Finally there are easy-to-implement and “cheap” datacenters available, and most importantly, there is a lot of data.

When looking at ML and AI, we understand the implementation of some algorithms that have been known for several decades. At universities, students have years of experience with their implementations in environments such as R, Mathlab, etc. and many applications like neural networks and fuzzy matching applications have been part of the devices, programs and apps we use every day. But something has triggered today’s ML and AI rocket boom, and big companies in this case are at the far end of this trend.

On the one hand, real Big Data arises from a variety of data sources and, on the other hand, becomes more and more accessible. Datacenters bring a huge processor potential and, in principle, unlimited storage capacity to the home laptop. It would not be enough, however, if there was no radical shift in the actual implementation of known algorithms and in the area of service platforms that new algorithms can run in a parallelized environment. Historically, there are several major technology vendors in the field, but the real gamechanger is somewhere else.

In my opinion, a key role in this area has been played by the Open-source community and the big players who have stepped up to this, from the corporate point of view, new wave. Looking at the many commercial platforms and products proven by years that have begun to offer miraculous math and statistics in the last few months , you will find the same open-source libraries of algorithms and the same computing environments. There are players on the market who have been developing their own implementations of ML and AI algorithms for years, but the domination of projects from the Apache family is obvious.

This blog aims to go deeper under the surface of the topic and to get closer to the real experience of implementing customer needs in corporate big data. This story began quite uneventfully. The procurement department of a major global corporation found information on our competence in the field of opensource platforms on the EEA website and sent us a query on the potential implementation of statistical algorithms using opensource platforms.

The client wants something, but you know he wants something else in reality

After the first exchange of opinions and explaining the customer’s expectations, there was a situation where the vendor’s salesman and consultant are often found. The client wants something, but you know he wants something else.
The client “only” wants to try out a specific family of statistical algorithms and to confirm on a small sample of data whether the results open up space for development and investment in predicting new business opportunities, or it’s a blind alley. At the same time calculations can take days, substantial is the result. The programmed result must be open-source or in case of finished components only open-source can be used.

Experiences from many projects show that “only” is in fact a customer’s requirement that it should not be expensive and it has to last for a short time. For a contractor it means that, in terms of research and know-how, he must provide at least 90% of what he would invest in a major project (a thousand days of work).
The second requirement that the processing time is not important means that the customer has no robust infrastructure and his idea is that the solution will work on ordinary computers. For the supplier it suggests, there is no infrastructure budget in the project.
The requirement for open source is currently standard. Companies have already learned that open-source does not mean free, but it mainly means minimizing the risk of a vendor lock-in https://en.wikipedia.org/wiki/Vendor_lock-in – dependency on one vendor. Open-source at the same time minimizes the legal risks that arise from various hidden shadow corners of copyrights and patents.

Looking for solution

But back to the merchant’s dilemma. This situation could be explained in different ways. This is how I evaluated the possibilities

I can strictly comply with customer requirements and try to be the cheapest among potential suppliers.The estimated budget would not be clearly defined, and all project resources would focus on programming the algorithm and attempting to handle the data sample in a reasonable time. This option I personally call nose-diving (to the bottom of the price and quality) and it is not generally beneficial neither for the contractor nor for the customer. Such waste of the potential can afford only the state administration in Slovakia, see https://slovensko.digital/. Corporations have the advantage of being able to use the potential of the contractor during procurement itself and the price may not be the only indicator. The decision to leave this alternative was made after an internal verification of technological options. We have tested the required sample of data in the R environment https://www.r-project.org/ and we have used one of the required algorithms– Apriory https://en.wikipedia.org/wiki/Apriori_algorithm that crashed the application during the first experiment due to lack of memory. The customer had the real Big Data input and the Apriory algorithm is not one of those that “forgets” during its run the previous iterations. It needs a lot of resources. The decision to improve the implementation of a known algorithm within our EEA internal development, which would be better than the implementation of the academic project R, could be fun for a while – we would be programming happily a bit, but in the context of business thinking it would be a road to hell. On the other hand, we knew that most of our competitors would blindly follow the original requirements. We have decided to teach the customer something new.
The second option was to try to change the customer’s perception of the problem and try to change clients requirements. However, it is hard to stop “a running train” in case of a large customer. We had to try to propose a functional solution within the existing framework.
The only visible option for us was to use existing solutions with already implemented algorithms using parallel-computing https://en.wikipedia.org/wiki/Parallel_computing. The opensource requirement simplified our platform selection to Apache and R solutions. The request for the ability to process real Big Data has ruled out the possibility of using R. The strategy was born. We will use Apache Sparkhttps://spark.apache.org/ and existing implementations of MLlib algorithms https://spark.apache.org/mllib/. We would try to convince the customer that using a more robust platform is a good investment to the future development of the project. We will not be the cheapest, but the most prospective supplier.

Quick calculations?

I guess I would not write this story if we did not win the bid. So to make the long story short. The customer received a demo in the form of a virtual server Docker in which the source datasets could be mapped, and from the command line it was possible to run an algorithm that generated the results into the resulting dataset. Very interesting results … The calculation did not take days, it virtually took a few minutes. The customer has been offered a solution to get a virtualized computing environment that will include, in addition to existing algorithms, several interfaces for implementing custom algorithms, especially for processing input datasets and mapping outputs to source data = applying knowledge from statistics to customer reality (but it will be covered in future blogs).

The offer also included a Zeppelin graphic web environment, https://zeppelin.apache.org/, which allows access to the platform through a web browser and makes the solution available to a number of data analysts at the same time. Zeppelin allows to unify more platforms at the level of one screen and it is a very promising tool to secure the frontend integration of systems even outside the Spark computing platform.

How did it turn out?

After more than a year of collaboration, the customer has a functional department that processes business data from around the world. It has information about millions of transactions from different markets and different segments to information on the very product transactions toward the customers.

Using ML algorithms helps the company’s business identify new customers. Currently, the system provides information to which customer it is worth to offer the product and how likely he is to buy it. At the beginning, successful business recommendations were estimates to be below 50% and the target was at least 40%. Today, the results range in 80-98% of success, which surpassed any expectations. But ML is not a new brain, that will replace existing merchants. ML can only learn and repeat.

To make it possible to use algorithms more or less in real time, it is essential to choose the appropriate input parameters. If the parameters are set to “low,” the algorithm yields no results, if they are set “high”, the number outputs and the combination of results may be close to infinity so the computing device will get stuck, overwhelmed, or at least will not provide reasonably interpretable answers. Here is a huge space for the AI. Sometimes even a thousandth in the value of the parameter determines whether the algorithm “works”. In this area, EEA has several unique concepts and solutions. We know how to “protect” available algorithms by artificial intelligence, which is continuously learning form the ongoing calculations on given customer data. The effectiveness of such an optimized and secured computing device may be the main argument whether it is possible at all to use the algorithms with the given data and requirements. In this, the EEA portfolio is original and currently distinguishes us from our competitors.

Common sense and Machine learning

ML is a nice buzzword as I suggest at the beginning of the blog. It is just a common statistic involved in properly implemented algorithms. Without a reasonable requirement and appropriate data, it has no value for users. On the other hand, if there is a reasonable customer requirement and available data, it has the potential of gold ore. ML can be the gold mine that can move the customer’s business to a new level. Whether the customer is in a situation where ML can help him reach his goals is a question for experts in the field, or it is about understanding the statistics as such and their potential in the customer’s domain.

To summarize, ML and AI is nothing new.

If they are to be effectively implemented, it is no more than good old work in the field of consulting, informatics and mathematics. Without the common sense and knowledge of the domain they are just empty concepts. Under certain conditions, however, this is a huge opportunity and today it is possible to distinguish itself from competition in only a few shades. Implementing ML and AI may be the exact shade that can move you forward

If you are considering to build your own gold mine in this area do not hesitate and contact us. We are building and have already been operating several.