Artificial Intelligence is all the rage. Established organizations are amassing data to support data-driven decisions and they need AI to deal with this data fast and efficiently. At the core of the hype are data scientists whose task it is to collect, structurize, explore, aggregate, and ultimately, learn and optimize.
These scientists ensure that AI models can do their work and that the used data is both accurate and useful. They preprocess the data. Without this, it’s impossible to make accurate predictions or well-informed decisions.
However, the vast amount of time spent on data preprocessing and finding the right data can be overwhelming. Even seasoned data scientists spend too much time on data preprocessing, leaving little time to spare on actually generating insights.
Disconnected data silos are slowing down the data utilization and scientists are in need of better data models that bring the pieces of data closer together. There’s friction between where the data is and where it’s supposed to be. This makes insight-generating processes even more tedious than they already are.
It’s no wonder therefore that many data scientists leave companies due to unmet expectations on both sides of the table: on one hand, the job ends up being mainly data engineering and not data sciences; on the other hand, the business does not get their insights and return on investment.
Based on surveys and learnings from a wide range of data science projects, it has almost become a rule that roughly 80% of a data science project is about data engineering. Changing this ratio so that less time is spent on engineering, leading to more time spent on generating insights, is critical to meet everyone’s expectations.
Mathematically, for every task T there are two forces at play: friction F and work W. The bigger the friction, the larger the work throughput needs to be to generate the insights. The outcome O is therefore the work minus the friction.
O = P * W - F
It’s this friction F that eats into the outcome of every single task T the data scientist needs to perform. However, not all tasks are created equal. Some have higher potential than others. For simplicity, let’s denote the potential as P, which acts as a multiplier for the work.
Great data platforms reduce the friction F, sometimes considerably. This has two consequences: (1) the impact of the outcome increases and (2) generating insights becomes easier for any colleague in the organization, regardless of their technical proficiency.
So friction is bad. It’s a resource sink. But it doesn’t have to be. Below are 5 areas that companies should evaluate to reduce friction and to ensure that data-driven decisions can be made throughout the entire organization.
Data is organized according to data models, which in some business domains, such as healthcare, can be hugely complex. Even when the data is in structured form in the DBMS (database management systems), it can be very time-consuming to find what you’re after.
The use of data catalogs and metadata layers, added on top of the DBMS, increases the speed of finding the required data.
Step 1 helps reduce the friction within a single DBMS, but most companies use multiple databases. All together they contain and describe the company’s data assets. Unfortunately, most of the time these assets are stored in separate systems that are hard to put together. Having a way to integrate these data in different data sources is very important in the fight against friction.
With big data platforms and data lakes, it’s finally economically feasible to store huge amounts of data in a single place. This enables the much-needed merging of data from separate systems. Although the structure remains largely the same, having the data in a single place reduces the burden to write ETL jobs (extract, transform, load) that are used to manipulate data, resulting in less friction, less preprocessing work, and happier data scientists.
Even when data engineers have access to all data, it probably still takes too long to execute the ETL jobs. This has serious implications for the speed at which data analytics can be executed, resulting in a delay in insights. The processing of iterations (as data-processing is trial-and-error by nature) simply takes a long time.
This situation can be improved in several ways: 1) restructuring the data to be faster to query, which requires work on the underlying data model, 2) adding a cache layer with the sole purpose to serve high-performance engineering and analytics requirements, or 3) leverage distributed computing infrastructure to parallelize individual data processing tasks.
To enable data-driven decisions, the relevant data has to be visible to every department. From C-suite and product development to finance, HR, and marketing. Everyone should have access to data, and generate insights from them, without having to write complex SQL queries.
It’s the non-technical medium such as self-service dashboards and reporting tools that transform the raw data records into insights without the need for engineers. This is what truly empowers the organization to act based on data. The challenge to solve here depends heavily on the data models an organization uses.
It’s also worth noting that, with proper data models, it’s easier to build APIs that bridge internal and external data, as part of the big picture can reside in the external data.
Increasing visibility and visualizing the data through dashboards, albeit a good start, may not be enough to get everyone excited. Sometimes the data needs to be transformed into a more suitable form, a set of KPIs, risk model outcomes, segments, and so forth, so as to be meaningful for the user. Most people aren’t trained in interpreting data after all, despite its increased importance.
Having proper ETL jobs to enrich the raw data, whether it’s simple heuristics and statistics, or full-blown AI systems, all aim to make the interpretation of the data easier. But in order to reach wider use in the organization, it’s at least as important to make the data easy to interpret.
Data Science and AI endeavors, and really any kind of work related to data, is heavily impacted by their underlying source systems; what the data looks like, how fast the data can be queried, and so forth. And all of these systems create friction.
When the friction isn’t minimized, it becomes a huge resource sink that eats away at the effort to work on generating insights. It’s therefore absolutely crucial that any data-driven company, or whoever strives to become one, takes data preprocessing seriously.
Veracell has a long track record of working with and enabling data-driven software development. We know from experience that it’s not always apparent at first glance whether an organization is suffering from data friction. With careful analysis of the underlying data infrastructure it’s possible to (a) understand a company’s current state and capability, and (b) how to improve that capability. Get in touch for a second opinion on whether you’re getting the most out of your data.