Wonach suchst Du?
What are you looking for?

Data lifecycle management concept

Some notes and sloppy definitions

Goals

We would like to achieve the following things. They are prioritised in descending order. Also, note that we are open towards ending up with different solutions for different goals:

Different approaches on the database level

On the database level, we discussed different basic approaches:

We will now go deeper into each approach.

Multiple databases

The only advantage of multiple databases which I see is that it’s easier to make a dump of only the current production data. Apart from that, the biggest disadvanteges I see are

Single database solutions

These approaches have the advantages of:

The DLM_ArchiveLevel column to flag records

How do we logically mark records as archived/background, so that the system knows to exclude them from operational selects?

Note that we will go with the column name DLM_ArchiveLevel in the rest of this text. Of course that name might still change.

The bigger the number, the “further” the respective record is away from “production”. Or, depending on the solution we will end up with, different numbers might denote different archive-blocks, like archived data from different years.

Yet to be decided: shall DLM_ArchiveLevel be a regular AD_Column which is then added to every AD_Table in question? As of now, I don’t think so. Nevertheless, the information might be well-placed in POInfo

Single table

The idea here is to make sure that in the normal operation, the where-clause of every SQL statement will be augmented with something like ..AND COALESCE(DLM_ArchiveLevel,0)=0.
That way the sytem normally never “sees” any background data above the actual database level.

Note: the COALESCE relieves us from the need to initially update every exising record’s column value to 0. It’s much faster to just add a column (without a default value) than it is to add and also set it. But apart from that, the COALSECE is just a technical detail and we will ommit it further on. Instead we’ll just write DLM_ArchiveLevel=0.

Archived transaction data like C_Order records could still reference unarchived master data like AD_Client with FK constraints in place and without the need to also archive (and thus duplicate) the master data records.

We can implement the DLM_ArchiveLevel on a “deep” level by renaming tables such as C_Invoice to something like C_Invoice_Tbl and then create a view such as

CREATE VIEW C_Invoice AS SELECT * FROM C_Invoice_Tbl WHERE DLM_ArchiveLevel = 0;

That way we would automatically have that where-clause in every SQL of our reports, functions, DAO-services etc.

Once this is in place, we can hope to achive performance gains by adding a (partial) index on the DLM_ArchiveLevel column.

We could also prevent all other indices from growing further by adding a ...WHERE DLM_ArchiveLevel=0 to them.

Note: we might decide to create the new views in a dedicated schema dlm, because we already have almost 200 views (which contain actual business logic!) in the public schema, and adding those new trivial views might introduce too much noise there.
We should in this case add “dlm” to the postgres search path.

Using config settings to change the visible “scope”

The idea of “hardcoding” a WHERE DLM_ArchiveLevel=0 into every select has the drawback that now it is harder for the application to access background data in case the user actually wants to access it.

One solution might be to directly access the C_Invoice_Tbl table instead of the C_Invoice view, but it might turn out to be complicated to implement this at all code location which generate SQL.

The solution which I think we should follow instead is to use postgres settings.

We would modify the above view to be:

CREATE VIEW C_Invoice AS SELECT * FROM C_Invoice_Tbl WHERE DLM_ArchiveLevel <= current_setting('metasfresh.DLM_ArchiveLevel')::integer;

In the server’s postgresql.conf we could add something like:

custom_variable_classes = 'metasfresh'
metasfresh.DLM_ArchiveLevel = '0'

And then, in metasfresh, we could invoke

set_config('metasfresh.DLM_ArchiveLevel', '1', false)

This way we can implement metasfresh to run “globally” in a mode where also archive data is visible.

If we choose the boolean parameter (which is named is_local) to be true, then the setting will only be in effect for the current transaction. This way we can implement metasfresh to only see archived records in the context of one particular search.

Credits/further reading:

Update: this doesn’t work

We tried it and the problem here is that postgres won’t use the partial index if we have current_setting('metasfresh.DLM_ArchiveLevel') in the view’s where clause.

I assume that the reason is related to current_setting() being VOLATILE. That way, the planner doesn’t know when ` DLM_Level <= current_setting(‘metasfresh.DLM_ArchiveLevel’) is equivalent to 0 <= 0.

It might also be the case that the function’s volatility might not be the actual problem, but instead that the planner can map the view’s

WHERE COALESCE(DLM_Level,0) <= current_setting('metasfresh.DLM_ArchiveLevel')

with the partial indices’

WHERE COALESCE(DLM_Level,0) <= 0

See this stackoverflow answer for more details.

Either way, we can’t have that function in the where clause and at the same time have a partial index.

So, in order to access data from metasfresh, we need to either query the underlying `_Tbl’ table directly or first migrate the data in question back to “production”.

Update 2: no point in having a partial index

Inserts and updates

When we append _Tbl to the actual table name and add a view with the original name, then the question about write-access to the renamed table arises.

We can append _Tbl to the table name in each DML and DDL statement that we create in metasfresh.
Or we can use postgres to “translate” an insert or update that aims at a view into the respective operation on the underlying table.

As of now, we can assume that we will do DLM only on postgres-9.5, so we can make use of postgresql’s ability to forward and update to our views directly to the underlying table. See e.g. here and here for more details.

Table partitioning

This approach builds on top of the “single” table approach.

In addition to what we might gain with the “Single table” arpproach, table partioning offers those advantages:

Further reading to understand the underpinnings of multiple table spaces:

PostgreSQL-Version

We currently have PostgreSQL versions 9.1, 9.3 and 9.5 out there.

When rolling out DLM, we can assume that the DBMS is postgresql-9.5 or later.

Rough architectural outline

General implementation note:

I think there are four main parts to look at

Partitioner

By partitioner I mean a component that runs in background and creates partitions.

Partitioner and single table approach

It’s important to note that we need the partitioner, even if we “just” go with the single table approach.
This is because to the application, background data won’t be “there”, even with the single table approach (remember the where-clause-view).
Therefore, if we set DLM_ArchiveLevel=1 for a given C_Order record then we need to do the same for the C_Invoice records which referenece that order via C_Invoice.C_Order_ID. If we don’t, we end up with a “dangling” reference and there will be errors in the application. Other than in the “table partitioning” approach, we don’t have FK constraints to prevent such a situation.

Optimistic partitioner

However, if we learn that the single table approach in general can give us a preformance boost already, we might be able to get away with a less rigorous or say “optimistic” partitioner. This partitioner would not add to a partition everything that is required to retain referential integrity, but just what we think usually goes together application-wise.

In that scenario, if an invoice is still within production, but its order can’t be loaded because it is already in background, and if the application needs to load that order, then we do as follows:

We could then provide ourselves with statistics about which records had to be handeled this way, and how often it occured and how much time it cost. We can use these statistics to tune that partitioner.

On the other hand, we can’t handle reports in the same way, because they are “just” SQL, so we can’t trigger this sort of exception handling, and we even might not become aware those problems.
The next section is about how we can solve that.

Also see the section “Idea: a “self-learning” combination of migrator and optimistic partitioner” further below.

Reports and single table approach

As written above, with the single table approach, no FK constraints prevent us from causing “dangling” references within our production data. As long as our reports are essentially SQL-based, we need conclude that without some sort of automatic validation we can’t guarantee on a constant basis that

We can’t guarantee it, so might sometimes work out and sometimes not. I see these options:

Integrity and single table approach

We should be able to add triggers which inforce FK-like integrity among records, within the single table approach. Those triggers would only fire during migration, to there would probably be no performance degradation otherwise.

Example: asuming that there is an FK-Contraint for C_Invoice.C_Order_ID referencing C_Order.C_Order_ID. Now we could add a trigger.

This would be the trigger. The name follows our convention for FK constraints, with a prepended “dlm_”:

CREATE CONSTRAINT TRIGGER dlm_corder_cinvoice 
AFTER UPDATE OF DLM_ArchiveLevel ON C_Order
DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW
WHEN DLM_ArchiveLevel > 0 /* only fire when we migrate out of production */
EXECUTE PROCEDURE dlm_corder_cinvoice_tgfn;

The trigger function could look like this:

CREATE OR REPLACE FUNCTION dlm_corder_cinvoice_tgfn()
  RETURNS trigger AS
$BODY$
BEGIN
	SELECT INTO 
		C_Invoice_C_Invoice_ID, C_Invoice_DLM_ArchiveLevel
		i.C_Invoice_ID, i.DLM_ArchiveLevel 
	FROM C_Invoice i 
	WHERE 
		i.C_Order_ID = NEW.C_Order_ID 
		AND COLAESCE(i.DLM_ArchiveLevel,0)=0;

	IF C_Invoice_C_Invoice_ID IS NOT NULL
	THEN

		RAISE EXCEPTION 
'ERROR: Migrating the C_Order record with C_Order_ID=% to DLM_ArchiveLevel=% violates the constraint dlm_corder_cinvoice, 
			NEW.C_Order_ID, NEW.DLM_ArchiveLevel
		USING ERRCODE = '235D3', /* '23503' is defined as foreign_key_violation.. we use 235D3 with D for DLM */
			HINT=The C_Invoice with C_Invoice_ID='|| C_Invoice_C_Invoice_ID ||' and DLM_ArchiveLevel='|| C_Invoice_DLM_ArchiveLevel ||' still references that order via its C_Order_ID column',
			DETAIL='DLM_Referencing_Table_Name=C_Invoice; DLM_Referencig_Column_Name='C_Order_ID';' /* shall be parsable by metasfresh*/
	END IF;
RETURN NULL;
END; $BODY$
  LANGUAGE plpgsql VOLATILE;

The above is not yet tested and probably has syntax errors!

I believe we can provide ourselves with means to automatically generate these triggeres and functions for existing FK constraints.

Further reading:

The partioner’s config

Further notes:

How to store partitions

About using document-refid:

So, we store partitions by

Notes:

Coordinator

By coordinator I mean a component that checks out existing partitions and decides what to do this them. Not every partition shall “blindly” be migrated.

DLM_Coordinator

Migrator

By migrator I mean a component that can receive partitions and is responsible to migrate them from one “level” to the other.

Filter

An extension in metasfresh that by default always adds an additional condition to the where clause. The “normal” metasfresh code shall not have to care. In normal operation, the background data just seems to be not to exist in the database.

We can achive this using the “view” that was introduced in the “Single table” section.

Idea: a “self-learning” combination of migrator and optimistic partitioner

Outline:

This way, a user can configure what he/she knows to or wants to add. The rest will be added automatically and might then be further customized by the user.

Increments and their estimations

DB level

Partitioner (TODO - 32h)

Coordinator (TODO - 16h)

Migrator (TODO - 32h)

Filter (8h)

Q&A

How time consuming is it to add another column to a large table?

alter table M_HU_Attribute add column DLM_ArchiveLevel integer is fast, because it just adds the column to the table metadata, but not assignes a value to any row.


View source file on GitHub.com