Stability of (G)AI needs Math underpin for data engineering!

Math for hygienic data management already exists, though database theory is yet to catch up. AI, let alone GAI, without a stable foundation of mathematical certainty to data quality cannot scale.

Aug 18, 2023

Summary: Category Theory (CT), a branch of mathematics for information structuring, and its successful application to data management have been there for some time in scientific and academic circles. These new foundations lend mathematical properties (rigour, composability, identity etc) to data management, useful for managing and transforming rapidly changing data, metadata, contextual relationships and lineage, with mathematical formality and certainty to hygienically moving data with assured quality. This paves way for disruptive models like data model abstraction, mathematically guaranteed unique denormalization paths, zero-ETL, real-time transactional data integration, merging, transformation etc. This is an opportunity to rewrite data management approaches that save time and effort, improve data quality and integrity for customers while bridging legacy and modern (IoT, CV) enterprise data sources. Current data management approaches do not have the formality for lack underlying math and do not offer flexibility, composability, reusability enabled by this model.

Extremely Brief Intro to Category Theory:

Category Theory (CT), a branch of Mathematics, is a theory of structures and systems of structures. CT is effective at structure representation, preservation, transformation, porting, knowledge modelling and context (knowledge of knowledge) capture, successfully used in scientific and mathematical fields for decades. CT is based on abstractions – structure representation and processing is in abstract terms (placeholders, as opposed to values held by the placeholders). This lends Category Theory interesting properties like structure transformation with mathematical guarantee to relationships, integrity of relationships, lineage across transformations and limitless abstraction of abstractions. These properties are used to map and translate across seemingly disparate scientific and mathematical concepts, to discover hidden relationships or simply model highly complex, large volume scientific relationships as in research result synthesis, drug discovery, even to unify shapes (graphs) with equations and free text. CT is the mathematical language of abstractions.

A Category comprises 1/ Constituents (set of objects and relationships among them called paths, with path equivalents) and 2/ Rules that the constituents must satisfy. This simple definition can span infinite levels of abstractions - categories of categories etc.

Category Theory for Data Management:

Category Theoretical approach to Data is developed at the Maths Department of MIT (Categorical Algebra) and the superiority of this approach over prevalent is well documented.

Database Schema are Categories (e.g., with Relational Entities as objects), irrespective of database type – relational, graph, document, columnar etc. When represented as Categories, they inherit the properties of CT, becoming interchangeable, mappable, transformable and even composable across disparate data model types.

Instances (data in a table following a schema) are also Categories. Since relationship among the entities (fields) are encoded in Schema Category, the relationship between the schema and each instance of data is also an encoded relationship, making it possible to enforce constraints at rest, ingress and egress.

CT treats any transformation mathematically – hence any ETL, de/normalization, conversion, mapping, even querying are treated as transformations from the resting form. Hence all of these transformations come with Mathematical certainty.

This results in interesting properties like enforcing structure preserving operations on data with 100% Mathematical guarantee to data quality, provenance reliability and constraint integrity, while making various data model types interchangeable, and enabling even IoT sensor data, ML/CV inferences or text to inherit CT properties, including data model abstraction, composability, extensibility, among other, even wilder properties (details in FAQ), cumbersome to achieve with traditional methods even at small scale.

Application of CT based Data Management: The above are just enabling features of CT. But it is the application of this approach that supercharges the business impact, saving ~ 50% data engineering time, effort.

1. Unifying (disparate) data sources: Automatically arrive at mathematically guaranteed, unique merge of schema (same or different types of database types) at schema level and verified at data (row, cell) level. Effective in unifying complex, large scale, deeply nested database schema, and keeping track of changes to rapidly changing data sources, versions (of schema and related data changes) and lineage. Related use cases include translation, porting of schemas across database types; dynamic, evolving integration scenarios e.g., causality discovered by an ML model can be mapped to existing data models on the fly.

2. ETL: ETL definitions are reusable, composable, referenceable like Math functions. This makes data engineers’ task easier for ETL definition (e.g., discover possible paths to transformations before coding), for reuse or extension (compose existing ETLs to form new ETLs like Math functions), for porting (from ETL proprietary formats e.g., Informatica to Glue), and for data quality assurance (ETL definitions validated programmatically with source data).

It is also possible to track data source changes and re-map the ETLs with the changes, making ETLs ‘alive’ and responsive to source data/schema changes.

It makes in-transit Extract, Transform and direct Load possible. This paves way for transactional data ETL – live transformation of transactional data as operations are happening, without a need to store data intermediately, even with streaming data.

Automation is also possible e.g., given a target state, arrive at the best way to map multiple sources automatically.

3. Database / DW Migration to Cloud: Migration challenges (null values, data that doesn’t comply with constraints etc.,) are elegantly handled with CT based approach as the entire activity is done at abstracted (cell, not value) level. Data quality – constraint compliance and data integrity issues can be automatically checked before and after migration, ensuring all entities (e.g., customer, vendor, contract) are reliable after migration. Guarantee of data quality, with provenance at data, schema and meta data level can be logged – e.g., for regulators for banks. This makes DB/DW migration to any Cloud faster and 100% error free.

4. Data Lake – automated hidden relationship discovery and mapping: Creating accurate, mathematically guaranteed, lossless de-normalization /normalization across large number of data-sets automatically is possible. Given existing schema and path equivalents, this approach can arrive at unique mapping across all data models in a Data Lake automatically and discover hidden relationships. Data Lakes, notorious for dumps of data without usable mapping, can be instantly made usable by automatically mapping all / subset of data sets, while automatically tracking changes to sources, ETLs and versions.

5. Creating data-frames for ML: Given Feature definitions (in terms of data fields) as inputs, automatically create data frames mapping available source data, schema and constraints with mathematical certainty. Proactively see which Features can/ cannot be materialized from available sources and why with guaranteed data quality, fidelity at data level (row and cell), not just schema level. Due to in-built provenance, one can even trace which dataset / row / cell did / did not work in ML training data sets, without manual tracing. This CT based approach also provides the ability to switch between data model, data and provenance inherently.

6. Business Metadata Abstraction: beyond simple cataloguing, defining layers of metadata independently by business/functional users can create and continuously evolve business’ own version of data definitions, in business language, attached to but independent of underlying technical data models and compose, query, share the same with other users / organizations, subject to e.g., Data Zone enforcements. This will democratize data usage by business users.

7. Context Fabric / Continuous Context Capture: Business context – data sources, relationships, internal / external factors influencing business outcomes etc - change rapidly, unpredictably in real world especially in business environment with external exposures, like supply chain. Current models (e.g Business Process) are rigid and grossly inadequate for capturing the reality. Category Theoretical approach enables continuous context capture with infinite, ad-hoc data model composability, extensibility and flexibility. This opens up a powerful way of context modelling with a foundation to capture rapidly changing industry context - e.g., IoT streaming data mapped with real-life external events, supply chain or manufacturing events with continuous context capture from varied data sources including human experts, edge devices, IoT sensors, results of ML/CV inferences.

This will provide a foundation for knowledge modelling for rapidly changing industry context with distributed participation e.g. Sustainability, Circularity that cannot be managed with data from one enterprise but a large ecosystem / industry. This opens up an opportunity to create a data context platform for each industry, with continuously, organically evolving data model involving industry players, regulators, statutory bodies, governments etc.,

This is the most impactful application for business outcomes.

8. Data Model Abstraction: Data models need not be locked in the databases. CT based approach gives a data model an independent existence, abstracting the underlying source schema and data, yet independent of technical implementation. This makes opens a universal and elegant abstraction of any data model possible, unleashing several possibilities. This approach lays the foundation for extensions like data privacy, residency, regulatory compliance, reliability, lineage and data sanctity (constraint sanctity, meta data and context sanctity) which can be extended infinitely without logging into the databases, without depending on technical resources – e.g. a lawyer extending legal metadata of a complex case on his/her own, or data residency rules of a country enforced.

This enforces ‘structure-preserving’ operations (e.g, CRUD, integration, migrations, transformations) to be done off the ‘data model’ instead of directly on ‘data’, ensuring such operations always result in data-model-compliant data at rest, during ingress and egress. Using Mathematical base to achieve this makes it more powerful – by assuring mathematical guarantee and accuracy to these operations.

9. Federated Querying: ‘join less’ property of CT can query tables / databases without joining them while keeping the context of cross-entity relationships. This enables query federation with the awareness of relationships, making the query results readily usable, unlike just query federation but joined later for overall context of relationships, create insights without moving data even with complex, nested relationships across distributed data sources.

10. CT based Database: There is an opportunity to create a new type of database entirely based on CT approach which will make not just data processing more efficient like the above, but also disrupt everything that touches a databases – ORM, programming, parameter declarations while coding in any programming language, standardize and unified, reusable declarations across development teams working on different projects but within one organization to make the declared data objects usable globally etc.

Overall Benefits: ~ 50% Faster, 100% accurate data transfer, transformation operations.

A recommended approach to developing Data Model Abstraction using CT is detailed here. This forms the basis for developing other applications.

FAQ:

How is this different from the opensource tool?

https://www.categoricaldata.net/

is an opensource tool to for preliminary learning and exploration of CT based data management. It cannot be scaled for real-life applications and cannot address the impactful use cases described above.

Is there a commercial version?

Yes. Conexus have delivered several successful commercial solutions / projects. Disclaimer: I advised Conexus founding team (Eric Daimler and Ryan Wisnesky) on product strategy briefly.

What are the unique properties of CT based Data Management approach? More details here

CT based data management has these unique properties unachievable in traditional DB approaches:

1. Data can coexist in multiple forms – eg., instantly change from table to graph to table.

2. Abstraction – cell, to cell placeholders, to table, to data model and transparently switch between actual values and placeholder

3. Null values – null value can exist and do not pose any challenge due to above abstraction.

4. Join less operations (dot operator) – can query without using joins, makes even deeply nested operations easy.

5. Rich language for constraints, relationships – hence codified context (cm-inches) – extractable, composable unlike stored procedures/triggers. Any programming language can be used.

6. Stored procedures/triggers can be extracted, re-written, exportable, resused, composed.

7. Data Model translation (any DM to any DM, RD-G translation, existing maps, composable, externally referenceable)

8. Query (result is not a table, can be used directly in code, composed, versioned, ORM’ed, used to declare variables in code directly)

9. Validation (source, transformation and destination constraints)

10. Inversions – mapping (transformation) inversions

11. Import / export CSV, SQL, JSON, RDF, XML, etc.

12. Check schema mappings and queries at compile time, not after materializing data.

13. Know in advance that landed data will satisfy all target constraints.

14. Incorporates any function from across JVM languages (JavaScript, Java, Python, etc).

15. Arbitrary user-defined functions, e.g. edit-distance between strings, can specify and reason about, e.g. for database transformations.

16. Check, clean, or repair against rich data integrity constraints

17. Repair using Chase algorithm on existential Horn clauses (EDs)

What are the benefits of this approach:

Data management, transformation, migration in

1. a fraction of time (~ 50%)

2. with zero errors

3. with 100% assured data sanctity – constraint compliance, integrity

4. with 100% traceability, provenance

5. reuse efforts

What is the solution approach? Detailed explanation here.

Existing data models will be converted into CT based data models. Open sources tools exist for this conversion. A software module will be developed to abstract these data models, inter-relationships and constraints, along with a UI. Developers can continue to use the data models without the knowledge of Category Theory but can still use its properties.

Converting data models into Category Theoretical Data Models will bestow Category Theory properties like structure preservation, structure transformation, knowledge modelling and continuous context capture with mathematical guarantee to data quality, lineage and 100% constraint compliance at rest, during ingress, egress.

How is it different from Set Theory, Graph Theory?

CT based approach can span infinite levels of abstractions (categories of categories etc.,) and makes CT different from set theory or graph theory in scale, ability to process data in abstractions and programmatically use the strengths of CT in a database.

How is it different from Graph Data Model? As above.

User Experience Sample:

Note: There are several ground breaking applications based on concepts of CT, as described above, which are yet to be created as ready-to-use tools for data engineers and data scientists, without forcing them to learn CT. The above concepts, applications, user experience examples are protected by Intellectual Property Rights owned by Surendra Kancherla, barring the ones protected, owned or open-sourced by Conexus, CQL and their personnel.

Surendra Kancherla’s Substack

Discussion about this post