Originally posted as a guest column on siliconANGLE
By Dave Duggal
The recent news about generative artificial intelligence is creating a lot of justifiable excitement around the new possibilities for human-computer interaction. However, as with past AI and machine learning hype cycles, realizing material benefits will be elusive if generative AI is treated as a magic wand, particularly in the context of enterprise-grade automation.
Although generative AI has already spawned much experimentation, most demonstrations are focused on the technology’s ability to generate outputs — text, music, art and even code templates — as guided by human prompts. These outputs, while impressive in their own right, are generally understood to be first drafts, often with varying degree of acceptable errors and omissions, such as an entry-level assistant might provide.
However, early generative AI automation prototypes tend to demonstrate simple if-this-then-that-style capabilities without reference to security, governance and compliance concerns. Complex enterprise and industrial systems have a far higher standard to automate processes safely, with complex use-case requirements, organizational policies and industry regulations, which require deep domain-specific knowledge and rules.
While AI and automation will continue to converge for increasingly intelligent solutions, the need for both unmodeled, probabilistic analytics and modeled, deterministic transactional capabilities will endure.
Generative AI and domain models: better together
Although some of the early hype from the deep learning space suggested that modeling is old-fashioned and now unnecessary, the limitations of generative AI and its brute-force Bayesian statistical analysis are already apparent. Leading AI expert Andrew Ng, founder and chief executive of Landing AI Inc., recently acknowledged this in an article promoting better data over more data for generative AI.
What is “better data”? It’s data that comes from a modeled domain — “labeled” or “prepared.” These two worlds are not at odds. It makes sense that good facts make for informed guessing. Likewise, the modeling world can leverage the power of generative AI to “bootstrap” domain modeling, which was already a semi-automated effort. It’s a sign of maturity that we are getting past the false binary stage and are now thinking of how old and new approaches are better together.
Realizing better data
The DIKW pyramid (Data, Information, Knowledge, Wisdom) is known to any data analysts worth their salt. In effect, it’s a maturity model for data modeling. Each step up the pyramid supports higher-level reasoning about data and its use:
- A domain defined with classes and entities raises data to information – the formalization and validation of data.
- Higher-level domain knowledge expressed in the form of graph concepts, types and policies (i.e., ontology) raises information to knowledge – the relationships support analysis and insights.
- When knowledge is applied to optimize action, we achieve wisdom, the top of the DIKW pyramid — understanding activity in its domain context.
The effort to model domains rigorously creates clear, consistent, coherent domain understanding not only for humans, but also for machines. This is why “better data” improves generative AI results, as well as AI/ML and analytics.
Context is expensive
It makes sense that the closer data is to a specific application domain, the more precise its meaning in that domain vernacular and the higher its utility. Conversely, when data is aggregated outside its domain context, it loses meaning, making it less useful. To application developers, this is the conceptual foundation of domain-driven design, the microservice “database per service” pattern, and data meshes of federated domains.
Data architects dismiss these app-centric approaches as siloed, which is true, but the AppDev rationale is sound. Developers absolutely want data, but they need fast, efficient, secure access to relevant data that they can apply in context of a specific application. They work tactically within technical constraints and latency budgets (i.e., how much I/O, data processing, analysis and transformations can they practically perform in the span of a single human interaction of one to two seconds) or system interaction (generally subsecond). There is also a real financial cost to I/O and compute-intensive applications in terms of network bandwidth and resource consumption.
So context is expensive. It’s worth noting that historically centralized solutions promoted by data architects, such as master data management, data marts, data warehouses, big data and even data lakes, haven’t exactly been unqualified successes. Generally, central data is historic (not operational), the processing is batch (not real-time), extracting relevant insights is difficult and time-consuming (inefficient). From a practical perspective, developers generally throttle data processing for latency budgets and focus on local, application-specific parameters instead.
The tension between the perspectives of data architects and software developers is clear, but data and code are two sides of the same coin. The barrier to more data-driven applications is technical, not ideological. The focus should be on reducing the “cost” of real-time, contextualized data for application developers and making it easy for them to rapidly compose real-time, data-driven applications.
Beyond the data lake
The combination of generative AI, knowledge graphs and analytics can help data lakes avoid becoming data swamps. Together, they can provide useful abstractions that make it easier for developers to semantically navigate and leverage centralized data lakes. In effect, the knowledge graph brings domain knowledge (“better data”) to the large language model, allowing more precise analysis of the data lake.
One company addressing the technical challenge of the data value-chain is Snowflake Inc., which provides a data lake and is now layering on added-value capabilities, including LLMs, knowledge graphs and assorted analytics tools, as discrete offerings exposed in its marketplace. In addition, Snowflake’s Snowpark developer environment supports DataFrame-style programming allowing developers to add their own user-defined functions in language of their choice.
As Dave Vellante and George Gilbert reported during “Data Week” last week, this will be a boon to developers who can use Snowflake’s Snowpark to containerize precompiled functions and algorithms that run local over historic data and streaming operational data aggregated in Snowflake. These “data apps” benefit from reasoning directly over the data lake, with common security and governance. In essence, Snowflake is reviving old database techniques (e.g., stored procedures; user-defined functions; and create, read, update and delete or CRUD events) for its data cloud to drive local processing and increase its utility as a data platform.
Although all this cloud-based data processing won’t come cheap, it does represent a leap in the scope, ease and performance of traditional Online Analytics Processing, allowing deep insights to be extracted in real time, which can then drive recommendations and dashboards, as well as trigger notifications and actions to external people and systems.
The industry has come full circle. Enterprise data once centralized in the mainframe, then fragmented by distributed systems, the internet and now the edge, is being aggregated and disambiguated in the cloud with near infinite compute and storage capacity and high-speed networking. It presents a powerful new data backplane for all applications.
Snowflake, along with its competitors, are all now looking up to the application layer. The application layer is the strategic high ground because it’s where data can optimize business automation and user experiences.
To the application layer and beyond!
It’s still up to developers to exploit business intelligence in web and mobile applications, as well as their more complex business applications and operational processes. Snowflake’s containers and data app store are interesting, but it’s not a programming model or development platform for software engineers.
Rather than having to discover and manually integrate purpose-specific data apps from an app store, it would be ideal if developers could programmatically configure a data lake provider’s native data service (e.g., LLMs, AI/ML, graph, time-series, cost and location-based analytics) directly from their application domain so they can precisely tune queries, functions, algorithms and analytics for their use cases.
This is the best both worlds! It’s convenient developer access to a one-stop-shop of aggregated enterprise data — historic, real-time, batch, streaming, structured, semistructured and unstructured data — that reduces the “cost” of context so developers can efficiently exploit data for more intelligent solutions.
This proposal brings applications closer to data, while maintaining the separation of concerns and avoiding tight coupling. In this approach, a knowledge graph, representing relevant domain knowledge, provides an abstraction for composing data and behavior for real-time, data-driven applications.
Dave Duggal is founder and CEO of EnterpriseWeb LLC. The company offers a no-code integration and automation platform that’s built around a graph knowledge base. The graph provides shared domain semantics, metadata and state information to support discovery, composition, integration, orchestration, configuration and workflow automation. Duggal wrote this article for SiliconANGLE.