The Rise of the Declarative

There is a growing trend to make analytics application development more declarative. Prominent examples, focused on accessing and transforming data, are dplyr for R, and Ibis for Python.

This is a move forward from the imperative way of developing analytics applications. The idea is to make the code more composable, reusable, and easier to write.

However, these libraries do not gel with the imperative nature of the underlying language — the developers need to switch to declarative thinking when using these libraries, and switch back to imperative thinking for other parts of the code. This is not ideal.

Rather than graft declarative constructs in an imperative language, Sclera extends SQL.

SQL has been used for declarative data wrangling and transformation for decades, and is familiar to almost everybody who has ever worked in data management or business intelligence. A recent survey shows that SQL continues to be tremendously popular among developers.

Sclera’s scripting language, ScleraSQL, provides SQL extensions that help express complex analytics tasks in tens of lines instead of hundreds.

ScleraSQL includes extensions for streaming data access, data transformation, data cleaning, machine learning, and pattern matching, as well as “Grammar of Graphics” constructs for declarative visualization, similar to R’s ggplot2.

Data analytics is an extension of business intelligence. It makes sense, therefore, that your analytics language of choice is an extension of SQL.

Hiring Data Scientists – Why Compromise?

The following is a popular definition of a data scientist:

According to this definition, a data scientist is not necessarily the best statistician, and not necessarily the best software engineer.

Hiring such data scientists is clearly a compromise. The reason behind the compromise is that a data scientist needs to juggle both software engineering and statistics – so being great at one of these but not at the other might not work out for the best.

It does not have to be that way.

If I were building a data science team, I would rather hire the best statisticians and the best software engineers, and have them work together.

How can they work together?

One way is to separate the analytics logic (the “what”) and the engineering aspects (the “how”). The statisticians can then work on the analytics logic, while the engineers work on the engineering.

In Sclera, this is facilitated by high-level building blocks for data access, data transformation, data cleaning, machine learning, pattern matching, and visualization.

Statisticians specify the analytics logic by building a pipeline of these building blocks, while the engineers provide implementations of these building blocks.

From the statistician’s point of view, this results in greater productivity. The high-level analytics specification is only a few lines of ScleraSQL code – easy to write, and easy to modify for iterative experimentation. Sclera optimizes the code automatically, ensuring the best performance on the available resources.

From the engineer’s point of view, the problem is well-defined – build the most efficient implementation of a building block. The semantics are clear, so no distractions in terms of ever-changing specifications, and the code is reused in a structured manner across multiple applications.

Sclera comes with a number of pre-packaged building blocks, and an SDK which can be used to write additional building blocks.

Sclera helps you get the best out of your statisticians, and the best out of your engineers. So why compromise on the hiring?

Building Usable Analytics Platforms: Untangling Data Science from Engineering

A recent survey finds lack of skills and lack of understanding of technology as the primary barriers to analytics. Over 52% of the respondents cited lack of skills, while 33% cited technology as the challenge.

This is not a new finding — and highlights the usability gap that the available analytics platforms have not been able to bridge, even years after advanced analytics came into the limelight.

In this post, we take a step back, and analyze what needs to be done. Our approach will be targeted towards users who know the rudiments of business analytics, and would rather focus on the analytics tasks than care for what lies under the hood.

Simplification by Separation of Concerns

What exactly are we looking to simplify? Using advanced analytics platforms today needs technical skills in the following areas:

  • Machine Learning: Expertise in the statistical, mathematical and algorithmic aspects of analytics. This is a deep science, and requires years of mathematical training to build the insights.
  • Software and Systems: Expertise in the systems and operational aspects of analytics; when dealing with large amounts of data, this includes expertise in the “big-data stack”. This is deep engineering, and requires mastery over building systems that work correctly, reliably, and as efficiently as possible.

As we can expect, individuals with high competence in either of these skills are not easy to find — and those with high competence in both are far rarer.

Moreover, it is not as if expertise in just one of the skills, say machine learning, makes things much easier. The following post shows that software engineering is more than simple coding skills, just like machine learning is more than simple arithmetic.

I am a data scientist/analyst, and my day to day is entirely in python/scikit-learn/pandas, data munging and running models. Right now my code is several hundred lines of data processing steps, filtering, lots and lots of joins and sql queries, pickle dumps and loads, print array.shape. […] Long story short, I have a physics background and was never taught how to properly structure my workflow for this type of coding. elliott34, Hacker News, 3 Dec 2014

Our approach in this post is to enable separation of concerns — that is, divide the role of a “data scientist” into an analyst and a systems engineer, and provide a framework that enables them to work together. In doing so, we reduce the technical requirements for the analyst to the extent possible.

Sounds impractical? Actually, this has been done in the past, and with great success, in the context of data management. To understand how, let us quickly review the evolution of database management systems.

Where have we seen that before?

At its inception, in the 1960’s, data management was the domain of systems engineers — out of reach of the intended users: the business analysts.

Several database products did indeed exist at that time; however, they were without exception ad hoc, cumbersome, and difficult to use—they could really only be used by people having highly specialized technical skills—and they rested on no solid theoretical foundation. E. F. Codd’s biography, by C. J. Date

Clearly, data management was in the same state of affairs as advanced analytics is in now.

The need to simplify these systems motivated several efforts, culminating in E. F. Codd’s proposal of the relational model.

The relational model enabled queries over the data to be structured as a dataflow, using the relational algebra. The brilliance of relational algebra was in identifying a small number of primitives (relational operators — SELECT, PROJECT, JOIN, etc.) such that the majority of queries could be expressed as a composition of these primitives applied on the input data.

This simplified ad-hoc data access dramatically, as it separated the “logical” specification from the “physical” evaluation.

  • The focus of the “logical” specification was to capture the user requirement in terms of well-formed queries. As mentioned earlier, these queries were expressed in terms of the “high-level” relational operators, and were separated from the concerns of evaluation. Soon enough, accessible languages (prominently, SQL) and friendly graphical interfaces were developed to further simply the creation of these queries. These interfaces were readily picked up by business analysts.
  • The focus of the “physical” evaluation was to evaluate the logical specification as efficiently as possible. The systems engineers contributed to this layer. They provided efficient implementation of the relational operators in the database system. Also, in their role as “database administrators”, they ensured that the system worked efficiently and reliably, and also tuned the data layout for efficient evaluation.

The value of this logical-physical separation is apparent in the success relational databases have enjoyed over the years.

Making logical-physical separation work for analytics

So, how can we achieve a similar separation of concerns in advanced analytics? By building an algebra for advanced analytics, and using it to separate the “logical” analytics specification from the “physical” analytics evaluation, including computation on the big-data stack.

Initially, let us assume that the data to be analyzed is stored in a relational database (we will relax this constraint later). Then, it makes sense to develop the analytics algebra as an extension of the relational algebra. This implies that the analytics operators should take tables as input, and emit tables as output — just like relational operators.

Logical Specification

We start by incorporating analytics constructs, such as classifiers, as first class objects — at par with tables and views — and provide an interface to create such objects.

Recall that when creating a table in a relational database, the user does not think much about whether the table will be stored as a “B+ Tree” or a “Heap File”. The user simply states the table’s properties — the set of attributes, primary/foreign keys and other constraints, and optionally provides the query which will be used to populate the table.

Likewise, for the purpose of the specification, the analytics objects are abstract; the details of the underlying structures and statistical model are implementation details, and should not be the user’s concern. The user should only need to provide the configuration parameters and the query that emits the training data, and the system should train the object’s model using the same.

For instance, when creating a classifier, the user provides the query for the training data, and parameters identifying the target and feature columns in the output of the query.

Next, we define analytics operators that “apply” these objects on new data points. For instance, after a classifier is built, it is used to assign class labels to new data points (input rows) — this can be captured as a relational operator that augments the input rows with a new column containing the assigned class label.

Similarly, we can incorporate “clusterer” objects, and define operators that assign clusters to input rows. For text analytics, we can have “entity extractors” as objects, and define operators that extract entities from text-valued columns in the input rows. And so on.

As with database systems, the users need not use the proposed analytics interface directly. We can extend standard SQL with statements that parse to the analytics tasks (e.g. creating a classifier), and clauses that parse to operators (e.g. assigning class labels using the classifiers).

For example, in Sclera, the following statement trains a classifier myclassifier for identifying prospects using a survey on customers:

create classifier myclassifier(isinterested) using
select survey.isinterested, customers.location, customers.salary
from survey join customers on (survey.custid = customers.id);

Here, isinterested is specified as the classifier’s target column, and the remaining columns, location and salary, become the features. The following query then uses myclassifier to identify prospects among target customers, putting the prediction in the column isprospect:

select email, name, isprospect
from (targets classified with myclassifier(isprospect));

Alternatively, as with database systems, we can build graphical user interfaces to create the tasks and queries interactively.

Physical Evaluation

The physical evaluation provides concrete implementations for the specification abstractions — that is, the analytics objects and operators.

Let us consider the analytics objects first. How do we choose an implementation for, say, a classifier? There are a number of alternativesdecision trees, naive bayes, and so on.

Ideally, given the configuration parameters and available data descriptions, the system should automatically identify which analytics implementation to use — but since this is a tough call, a more pragmatic approach is to have a default implementation, and provide interface parameters that enable the user to override this default.

Continuing our classifier example, the following figure shows the creation and training of the classifier using Sclera. Since the specification does not include an override, the classifier implementation is taken to be a decision tree.
title

The analytics object implementations can be built from scratch, or taken from an off-the-shelf analytics library such as Weka or Apache Spark / MLlib, or even wrap over a cloud service such as Google Prediction API.

The analytics operations are evaluated using the methods provided by the object implementations. The evaluation may involve transforming the input data to the structure required by these methods, and transforming the result so that it can be included in the operator’s output.

The figure below illustrates the application of the trained classifier myclassifier in our example. The “classify” operator gets translated to myclassifier‘s decision tree-based classification function. For each row of the input table targets containing target customers, this function uses the values in feature columns salary and location to compute a value denoting the customer’s potential interest, which is put in the new column isprospect.
title

Virtualization

Separating analytics specification from evaluation gives us the choice to arbitrarily use one or more analytics engines — developed in-house, off-the-shelf, a cloud-based web-service, or a combination thereof — as implementations of the analytics operators. As long as the specification is not changed, the evaluation can also switch across the backends without affecting the application. We call this analytics virtualization.

Likewise, for relational queries and statements. As long as the relational interface is maintained, the evaluation can be pushed across to one or more data platforms, relational or non-relational, without affecting the application. This is data virtualization, and is implemented by building drivers that enable the underlying data to be accessed as tables, and also evaluate the relational operators and statements on these data platforms.

With data virtualization support, we can now remove our assumption that the data being analyzed is stored in a relational database. Since the data can be accessed using a virtual relational interface, it can actually reside across relational databases (e.g. MySQL, PostgreSQL), non-SQL databases (e.g. MongoDB, Apache HBase), HDFS, the local file system, or even a web service.

Conclusion

In this post, we outlined a principled, declarative approach towards building easier to use analytics platforms. The idea is to enable logical-physical separation, which has worked very well in the past in context of database systems.

Several vendors in the past have attempted extending SQL with analytics capabilities, including Oracle and, more recently, Metanautix. In Oracle’s solution, model creation is not integrated with SQL, and needs to be done in a PL/SQL routine. Metanautix’s approach is primarily imperative, in contrast to the declarative approach presented above.

The advantages of the declarative approach over these imperative alternatives will be the topic of another post.

Meanwhile, comments are welcome!