One of the first things we think of when looking at business requirements is “what data are we going to operate on and what tables/attributes do we need to create?”
Understanding our data requirements is necessary and forms the core of Domain Driven Design. The “issue” is the way we leap to the conclusion that we need to create tables/attributes to store our data.
We have been told for decades that we should use techniques like XML to pass data around our enterprise solutions. So, what if we also stored the data in XML format to make our Data at Rest services data agnostic?
Consider our previous example of an Insurance Policy service. Instead of creating separate tables for each of our concepts, we could create one table to store all our instance data agnostically in XML format. This table would contain a single XML data field to store the instance data. It would also have a set of meta-data fields that describe what that data is and where it resides with respect to other data records. Using this approach, we are still able to persist our data, but we have significantly reduced the number of stored procedures and business object functions, since we now only have a single table to maintain. By centralising data access, we can also extend the CRUD functionality into the CRUD+ realm by adding version control, access control, and auditing.
One of the major benefits of the data agnostic approach to building database repositories is our ability to go way beyond basic CRUD functionality to CRUD+ with very little software.
Now we should consider some of the key structural relationships between records within our databases. One of the big ones is the relationship between a parent record and its children. This is normally represented by a one-to-many relationship between two entity types or tables in a traditional database design. The “child” record will include a foreign key which will be the unique identifier for the parent record.
In our Insurance Policy example, we can consider the Policy to be a parent record for the following child record types:
We could add an extra field called Parent GUID to our metadata that tracks the relationships between the child record and its parent. We could then simply ask for all Coverages for a specified Policy by providing a Policy Instance GUID as the Parent GUID, and the Concept GUID of a Coverage.
An obvious database performance optimization is then to create a secondary index on Parent GUID and Concept GUID. When you look at how most users navigate a data model through their user experience, most of the requests for data are for either a record using a known Instance GUID, or for a group of child records when you have the Parent GUID and Concept GUID. This makes basic navigation of the records very efficient.
We can extend our meta-data to include a Hierarchy field which is a “/” separated list of Instance GUIDs which shows where a record exists in the total data model. For example, let us consider a set of related records with Instance GUIDs abbreviated as A, B, C, D, E. If A is the root of the data, then its hierarchy value is “/”. If B and C are children of A, then their hierarchy values would be “/A/”. If D was a child of C then its hierarchy would be “/A/C/” and if E were the child of D its hierarchy vale would be “/A/C/D/”. This now opens a more efficient way of searching the data set where we can ask for all children of A in a single query or ask for all children of A of a specific type.
This ability to extract hierarchical XML data sets for a specified instance in one simple query enables us to construct an arbitrarily complex document using XSLT.
We could also export an XML data set from the Data at Rest service, version it and then import it into another application.
There are lots of other search criteria that will ultimately need to be provided and we discuss those in the Command Query Responsibility Segregation (CQRS) model section later in this section.
Versioning and Audit Trail
Increasingly, we are being asked to add versioning of our data and audit trails. In a traditional approach, we would end up creating a history table for each of our database tables. So, for the Policy table there would be a Policy History table, for the Coverage table we would have a Coverage History table, and so on...
Using our data agnostic Instances table as a base we can now extend our model to include an Instance History table.
Now whenever we write a record to the Instances table using our Save Instance stored procedure, we also write a copy of the data to the Instance History table.
The basic idea is that we only keep the latest version of the record in the Instances table, and all versions (including the current version) are available in the Instance History table.
Experience with adding audit trails in real systems has led us to expand the metadata to include the following fields:
- Data Version – A sequential numeric field for the version number starting at 1
- Created By/On – a timestamp for when the first version of the record was created and who created it
- Inserted By/On – a timestamp for when this version of the record was added and who added it
- Last Updated – a timestamp for when the record or any of its descendants was last updated
Security and Privacy
The General Data Protection Regulation (GDPR) came into effect in May 2018 and applies to ANY EU Citizen whose data is being collected or processed in ANY part of the world. GDPR is seen as a gold standard for privacy and is being adopted to some degree in many countries that wish to do business with the EU.
A key feature of the GDPR is that it imposes new constraints on applications and technology to ensure the privacy of personal data. Going forward, our business solutions need to be designed to restrict who has access to personal data, and audit access to that data.
This requires new business solutions to enforce our corporate privacy policies and audit the fact that they have. An essential first step in doing this is to have row level security built in, so that visibility of records can be limited to specific users or groups of users.
With our traditional approach, we would need to add some form of mapping table for each concept to be secured, along with associated stored procedures, business objects and user experience to maintain it.
With our data agnostic approach, we can simply add an Instance Access table to centralise row level access control. There are two key concepts to consider here:
- User – user records will represent a person who will have access to the data
- Group – group records will define a set of users that all have the same access to the data
The Data at Rest zone will have services that enable the creation of Users and Groups. These services will also manage which users are assigned to each group.
The Instance Access table now just needs to map each Instance GUID to a User Group GUID that can represent either a User GUID (to grant a specified User access) or a Group GUID (to grant access to a group of users).
We would make the Instance GUID and User Group GUID the table primary key for the Instance Access table, and immediately we can secure any record in the database to any set of users and groups using a centralised management service.
Since users can only Create, Read, Update and Delete instance data through the provided stored procedures, we can now add a simple validation check to ensure that the requesting user has access to the record. People attempting to access records without a valid authorisation row in the Instance Access table would be immediately rejected by the Data at Rest services.
Virtual Deletion of Instance Data
The CRUD+ approach lets us implement a virtual record deletion mechanism by simply moving records that we wish to delete into a parallel set of tables as shown below.
An interesting feature of this approach is how it deals with parent-child relationships as outlined above. If we again consider the A, B, C, D, E example, then descendants of A cannot exist if we delete A. This implies that if I want to delete A then I also must delete B, C, D and E. No problem there – I can use the hierarchy field to immediately identify all descendants of A for deletion. But what if the user isn’t allowed to see D, or isn’t allowed to delete D given its current state? Then we should block the deletion of A!
That’s why we want to introduce a “Delete Request Validation” step that will evaluate the complete set of records to be deleted and then assess the user’s authorisation and the individual Instance Status to determine whether the user can delete each of the records. If any record cannot be deleted by the user at this time, the deletion will be blocked and the user will be informed.
Assuming the records can be “deleted” they are simply moved to the corresponding “Deleted” tables. This keeps them out of the way when we want to process live data, but maintains the audit trails, access rights and instance data of records that have been logically deleted. Such records can also be restored later by simply moving them back into the live tables.
We can also add a policy that permanently archives or removes these logically deleted records based on your business needs.
Complex User Defined Fields
Since we are storing data in XML, we are not limited to simple data fields. We can also introduce complex user-defined field types for compound fields like addresses. For example, our Policy instance may need to record the Home Address of the Policy Holder.
|Traditional table approach using columns of simple types
||Data Agnostic Services (XML Fragment)
|Home_Street1 = “177 Test Street”
Home_Street2 = ‘’ or NULL
Home_City = “Salford”
Home_State = ‘’ or NULL
Home_Country = ‘’ or NULL
Home_PostalCode = ‘’ or NULL
< Home Address >
< Street “177 Test Street”/>
< City “Salford”/>
< /Home Address >
Although either representation encapsulates the same information, the XML fragment can lead to simpler code in the Data Presentation, Data in Motion and Data in Action services where it is produced and consumed. Use of standardised XML fragments can also promote the creation of reusable code for all services which need to deal with complex data types. This is especially useful in the development of services using the Domain Driven Design approach, where the service can extract the elements it needs to process from the XML and leave the rest of the fragment unchanged.
Security Model for Data at Rest Repository
We need a simple security model to protect Data at Rest Repositories. Within a single Data at Rest Repository there are two root level Instance concept types:
- User profile – which represents an individual person
- Workspace – which represents a security boundary
A single Data at Rest Repository can have multiple User Profiles (one for each person with access to the solution) and multiple Workspaces (to protect actual Instance data).
Each Data at Rest Repository will contain a “Server Owner” Workspace which contains all the configuration data required by the solution. The Data at Rest Repository can also contain one or more “Customer” Workspaces to host the data that will be made available by the solution. The ability to host multiple “Customer” Workspaces was added to allow for multi-tenant deployments within a single Data at Rest Repository.
Every Instance of data created within a Workspace is contained within a compartmentalised security zone defined by the Workspace. This means that an Instance record created in Workspace A is never visible to Workspace B.
If necessary, Instance data from Workspace A can be shared with Workspace B using the Data in Motion components – this will preserve the integrity of the security model of the Workspaces. An interesting use-case for this scenario is creating Workspaces for each “Branch” of a business and then using Data in Motion to aggregate summary data into an “Executive” Workspace for reporting and analysis.
Access to Workspaces is granted to specific people by means of a User Profile. A User Profile uniquely identifies a person (or external system). The mechanism for generating a User Profile depends upon the security model chosen.
Having a User Profile GUID doesn’t directly grant access to Workspace Instance records. A person first must be invited to Join a Workspace. When the person has completed the join process, a User record is created for the person which references their User Profile GUID. The User record allows the person to access an authorised set of Instance records and functionality.
A person may join multiple Workspaces so that they have single sign on within the Data at Rest Repository. They may also join the same Workspace multiple times to access different sets of Instance records with different functionality. For example, the same person may typically access a Workspace using a “normal” user role, but on occasion may need to escalate to an administrative role. People may also be assigned different User privileges to access different operational roles, for example Customer Service and Human Resources.
The User security model is relatively simple and divides the authorisation into two distinct areas:
- What the user can see – assigning a User to a Group enables the User to access any records that are visible to the Group. A Team is always linked to a specific Group; assigning a User to Team also assigns the user to the Group. Teams were introduced to allow companies to create a separate authorisation layer to deal with multi-tenant deployments within a Workspace.
- What the user can do – assigning a User to an Application lets the User access the functionality provided by that Application. Within an Application, a User’s functionality may be further restricted by the Roles that the User has been assigned.
The Instance records (User, Group, Team, Application and Role) are maintained a managed XML data. The Tables (User Groups, User Teams, User Apps, User Roles) are implemented as regular database tables for performance reasons.
To simplify the onboarding process for new Users, a Workspace User Administrator can define multiple “Invite User” templates that provide a default of set of Groups, Teams, Applications and Roles for each User. When a User receives an invitation to join a workspace, they are provided with a GUID to the “Invite User” template. As they complete the join process, their new User record is provisioned according to the template. The Workspace User Administrator can also include a manual authentication step to check that only specific people are using the invitation to gain access to the Workspace.
Command Query Responsibility Segregation (CQRS) Model
If we were just navigating up and down a parent-child hierarchy of records, we would be pretty much done with our database design. However, we sometimes need to search through the records to find relevant results.
For example, let’s consider the simple request for all Home Insurance Policies in London that begin in June. The fact that we want Policies limits the search to just Policy Instances, but if we had to shred the XML to figure out which ones were for London homes that begin in June, we would be waiting a long time for the result – it is doable but VERY slow!
How would we solve this problem in a traditional approach? We would probably add a secondary index to the Policies table to optimize the query performance for PolicyType, AddressCity and EffectiveStartDate. Obviously, this is not easy to do if we just have an Instances table with and XML field!
We can solve the problem by taking a related CQRS model approach of using a read-only lookup table. From experience on many projects with traditional database design approaches, we have found that adding optimized read-only look up tables is an excellent method for improving performance. We can do the same here by simply adding a database trigger to our Instances table and updating a set of read-only lookup tables to mirror the data we need to search and filter by.
Using the database trigger approach is going to add a minimal amount of time to the initial data insert/update, but having an optimized lookup table has a big impact on the performance of the searches we need. The lookup tables only need the set of fields that are going to be searched and not all fields from the record, they can also be indexed in multiple ways to serve many different search requests. Since these lookup tables and the stored procedures required to maintain them are static, they are excellent candidates for code generation techniques.
This can be best achieved by using the Data Presentation Model for each concept to determine whether a lookup table is required, and which fields will be used for searching. As the Data Presentation Model is updated, the code generator will create a lookup table and a stored procedure that will shred the XML. A trigger on the Instances table then checks the ConceptGUID of any modified record and decides which lookup table needs to be updated.
Doesn’t putting all our data into a single repository create a single point of failure for our solution? If we were putting a database in a physical disk on a physical server in an office location, the answer is yes, it does. But if we were doing that, then you would likely have a lot of points of failure!
With modern database software, we have many ways of creating highly available database services. For example, if we are using Microsoft SQL Server as a database server we could use:
- Clustering of database servers
- Always On Availability Groups
- Azure SQL as a PaaS
There are many choices for building robust solutions with excellent availability characteristics.
A key advantage to using a DAS is that you only have a single Data at Rest repository to protect. You don’t end up with your data spread over multiple applications and micro-services, each of which has its own data repository. You can backup a single repository and know that your data is consistent across the solution, instead of having to backup multiple databases across multiple services, and not knowing the consistency of the solution data if you had to execute a restore after a failure.
Is there a limit to the amount of data you can put in a single Data at Rest repository?
Not really – the Data at Rest repository is a logical construct. The actual implementation of the underlying database storage can be anything.
In the simplest case our Data at Rest service could wrap a database which stores data and log files on separate disks.
As load on the Data at Rest service increases, we could look at vertical scaling of the database server by adding extra resources to increase CPU and RAM. A key limitation that we might reach for larger datasets is Input/Output Operations Per Second (IOPS) or Disk Space.
A simple solution to disk issues is to implement Filegroups. This lets us partition a table across multiple data files. The data files can be placed on separate disk infrastructure, enabling us to scale both total disk space available and IOPS.
Using Filegroups enables multiple scaling strategies to distribute data across multiple disks. For example, we could partition the Instances table by Concept GUID and create separate Instance tables for different sets of Concepts.
Filegroups also allow us to use a database-level table and index partitioning system. The partitions can be defined by the database administrator using a partition function. Filegroups and partitioning strategies are implementation-specific to the technology platform and so will not be covered in detail here.
At some point, you might hit a vertical scaling limit with a single database server. We can add a second database server and use routing at the External Interface to implement a service-level partition strategy.
Routing at the Data at Rest service interface can even extend to routing data to separate data centers. Routing could at that point be driven by geography or different security requirements.
The Data at Rest service approach gives us a truly horizontal scaling strategy that can be simply extended as needed by the business without making extensive changes to data models and applications.
A critical success factor for modern business architecture is sustainability – we need to be able to maintain existing functionality and rapidly adapt to changing business needs, without incurring significant costs or risks.
Using a data agnostic approach for our Data at Rest repository means that we can easily make changes to our underlying data models. For example, we can easily:
- Add or remove fields – its just XML!
- Add or remove whole Concepts
- Restructure how Concepts inter-relate
- Reuse existing Concepts in multiple places
- Add new functionality to extend our CRUD+
- Change our underlying database storage technology if more effective and efficient solutions become available
Maintaining a strict separation of concerns means that we can make changes to how our data is stored without impacting how it is shared (in motion), used (in action) or presented.
Onboarding new resources is also easy – we don’t need to spend weeks with a new hire trying to understand the data model and relationships. We only have a few simple tables and stored procedures to understand, with an architecture that can be fully articulated on a few pages and presented in a single workshop.
Data Models for Free
Because we are only operating with a single Instance table, our functionality is immediately available to all types of data within the solution.
Using a data agnostic storage approach means that we can extend our data models indefinitely to provide more functionality to the enterprise, without adding more software. Essentially, we are getting more of our asset (functionality) without incurring additional liability (code).