The companies succeeding in the digital age are often ones that have improved their data integration – going beyond simply collecting and mining data. These enterprises are integrating data from various, isolated silos to harness the data into business intelligence that can drive vital decision making and improves internal processes.
Data integration isn’t easy though, especially the larger your enterprise and the more software systems on which you rely. Then you consider the reality of most enterprise architecture: it is less often an intentional architecture and much more often a patchwork of various apps and ecosystems, from legacy systems to brand new tools. Some are built and owned in-house, while others rely on third-party vendors.
More and more, companies need to share data across all these systems. The problem is how difficult sharing data is when each system has different languages, requirements and protocols – and you consider the many iterations of one system talking to another.
What TileDB has got into their portfolio?
Unlike most database CEOs, TileDB founder Stavros Papadopoulos comes from the scientific, not the technology community. What eventually became TileDB originated out of yet another Michael Stonebraker MIT project, SciDB, that offered a database engine suitable for use by research scientists because of its array structure, now commercially available as Paradigm 4. Because the data is not force-fit into columns and rows, it can represent almost any kind of data structure — and commercially it has been used to build multi-dimensional arrays that have some resemblance to the early generation of denormalized MOLAP databases.
But Papadopoulos identified one key drawback to SciDB — it could not handle data sparsity very well. That’s where many columns are empty or null, a scenario that is quite common for genomic data sets focusing on how species or individuals are differentiated from one another; for people, the typical deviation across the human genome is barely 0.1%. Theoretically, you could store all the redundant data, but that would be a huge waste of resources; so as a result, most genomic data sets are highly sparse.
TileDB has the following features:
- Cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage)
- Tiling (i.e., chunking) for fast slicing
- Multiple compression, encryption and checksum filters
- Fully multi-threaded implementation
- Parallel IO
- Data versioning (rapid updates, time traveling)
- Array metadata
- Array groups
- Embeddable C++ library
- Numerous integrations (Spark, Dask, MariaDB, GDAL, etc.)
Canonical data models are a type of data model that aims to present data entities and relationships in the simplest possible form in order to integrate processes across various systems and databases. More often than not, the data exchanged across various systems rely on different languages, syntax, and protocols. A CDM is also known as a common data model.
How canonical data model works?
The purpose of a CDM is to enable an enterprise to create and distribute a common definition of its entire data unit. This allows for smoother integration between systems, which can improve processes, and also makes data mining easier.
Importantly, a canonical data model is not a merge of all data models. Instead, it is a new way to model data that is different from the connected systems. This model must be able to contain and translate the other types of data. For instance, when one system needs to send data to another system, it first translates its data into the standard syntax (a canonical format or a common format) that are not the same syntax or protocol of the other system. When the second system receives data from the first system, it translates that canonical format into its own data format.
A CDM approach can and should include any technology the enterprise uses, including ESB (enterprise service bus) and BPM (business performance/process management) platforms, other SOAs (service-oriented architecture), and any range of more specific tools and applications. In its most extreme form, a canon approach would mean having one person, customer, order, product, etc., with a set of IDs, attributes, and associations that the entire enterprise can agree upon.
By employing a CDM, you are taking a canonical approach in which every application translates its data into a single, common model that all other applications also understand. This standardization is good – everyone in the company, including non-technical staff, can see that the time it takes to translate data between systems in time better spent on other projects.