We would like to achieve the following things. They are prioritised in descending order. Also, note that we are open towards ending up with different solutions for different goals:
On the database level, we discussed different basic approaches:
We will now go deeper into each approach.
The only advantage of multiple databases which I see is that it’s easier to make a dump of only the current production data. Apart from that, the biggest disadvanteges I see are
These approaches have the advantages of:
DLM_ArchiveLevel
column to flag recordsHow do we logically mark records as archived/background, so that the system knows to exclude them from operational selects?
DLM_ArchiveLevel
wold be nice. Then we could even have multiple levels.NULL
or 0 would mean “not archived”Note that we will go with the column name DLM_ArchiveLevel
in the rest of this text. Of course that name might still change.
The bigger the number, the “further” the respective record is away from “production”. Or, depending on the solution we will end up with, different numbers might denote different archive-blocks, like archived data from different years.
Yet to be decided: shall DLM_ArchiveLevel
be a regular AD_Column which is then added to every AD_Table in question? As of now, I don’t think so. Nevertheless, the information might be well-placed in POInfo
The idea here is to make sure that in the normal operation, the where-clause of every SQL statement will be augmented with something like ..AND COALESCE(DLM_ArchiveLevel,0)=0
.
That way the sytem normally never “sees” any background data above the actual database level.
Note: the COALESCE
relieves us from the need to initially update every exising record’s column value to 0.
It’s much faster to just add a column (without a default value) than it is to add and also set it.
But apart from that, the COALSECE
is just a technical detail and we will ommit it further on.
Instead we’ll just write DLM_ArchiveLevel=0
.
Archived transaction data like C_Order
records could still reference unarchived master data like AD_Client
with FK constraints in place and without the need to also archive (and thus duplicate) the master data records.
We can implement the DLM_ArchiveLevel
on a “deep” level by renaming tables such as C_Invoice
to something like C_Invoice_Tbl
and then create a view such as
CREATE VIEW C_Invoice AS SELECT * FROM C_Invoice_Tbl WHERE DLM_ArchiveLevel = 0;
That way we would automatically have that where-clause in every SQL of our reports, functions, DAO-services etc.
Once this is in place, we can hope to achive performance gains by adding a (partial) index on the DLM_ArchiveLevel
column.
We could also prevent all other indices from growing further by adding a ...WHERE DLM_ArchiveLevel=0
to them.
Note: we might decide to create the new views in a dedicated schema dlm
, because we already have almost 200 views (which contain actual business logic!) in the public schema, and adding those new trivial views might introduce too much noise there.
We should in this case add “dlm” to the postgres search path.
The idea of “hardcoding” a WHERE DLM_ArchiveLevel=0
into every select has the drawback that now it is harder for the application to access background data in case the user actually wants to access it.
One solution might be to directly access the C_Invoice_Tbl
table instead of the C_Invoice
view, but it might turn out to be complicated to implement this at all code location which generate SQL.
The solution which I think we should follow instead is to use postgres settings.
We would modify the above view to be:
CREATE VIEW C_Invoice AS SELECT * FROM C_Invoice_Tbl WHERE DLM_ArchiveLevel <= current_setting('metasfresh.DLM_ArchiveLevel')::integer;
In the server’s postgresql.conf
we could add something like:
custom_variable_classes = 'metasfresh'
metasfresh.DLM_ArchiveLevel = '0'
And then, in metasfresh, we could invoke
set_config('metasfresh.DLM_ArchiveLevel', '1', false)
This way we can implement metasfresh to run “globally” in a mode where also archive data is visible.
If we choose the boolean parameter (which is named is_local
) to be true
, then the setting will only be in effect for the current transaction.
This way we can implement metasfresh to only see archived records in the context of one particular search.
Credits/further reading:
We tried it and the problem here is that postgres won’t use the partial index if we have current_setting('metasfresh.DLM_ArchiveLevel')
in the view’s where clause.
I assume that the reason is related to current_setting()
being VOLATILE
. That way, the planner doesn’t know when
` DLM_Level <= current_setting(‘metasfresh.DLM_ArchiveLevel’) is equivalent to
0 <= 0.
It might also be the case that the function’s volatility might not be the actual problem, but instead that the planner can map the view’s
WHERE COALESCE(DLM_Level,0) <= current_setting('metasfresh.DLM_ArchiveLevel')
with the partial indices’
WHERE COALESCE(DLM_Level,0) <= 0
See this stackoverflow answer for more details.
Either way, we can’t have that function in the where clause and at the same time have a partial index.
So, in order to access data from metasfresh, we need to either query the underlying `_Tbl’ table directly or first migrate the data in question back to “production”.
current_setting('metasfresh.DLM_ArchiveLevel')
to the view’s whereclause and gain more flexibility.When we append _Tbl
to the actual table name and add a view with the original name, then the question about write-access to the renamed table arises.
We can append _Tbl
to the table name in each DML and DDL statement that we create in metasfresh.
Or we can use postgres to “translate” an insert or update that aims at a view into the respective operation on the underlying table.
As of now, we can assume that we will do DLM only on postgres-9.5, so we can make use of postgresql’s ability to forward and update to our views directly to the underlying table. See e.g. here and here for more details.
This approach builds on top of the “single” table approach.
In addition to what we might gain with the “Single table” arpproach, table partioning offers those advantages:
DLM_ArchiveLevel=0
is made, then all the data would be retrived.pg_dump
tool to exclude the background data tables to achieve a smaller dump which contains just the production data.Further reading to understand the underpinnings of multiple table spaces:
--exclude-table-data
which takes a file with tables to exclude; we could have a tool (select on pg_catalog) to create that file.We currently have PostgreSQL versions 9.1, 9.3 and 9.5 out there.
When rolling out DLM, we can assume that the DBMS is postgresql-9.5 or later.
General implementation note:
I think there are four main parts to look at
By partitioner I mean a component that runs in background and creates partitions.
AD_Client
is referenced by almost every other record. Also the table is small, and it is “masterdata” that can change over time.
Since we realize DLM via a single-DB approach and the archived-records can reference AD_Client
, there is no point to have AD_Client records as parts of the partition.C_AllocationHdr
is not referencing anything (besides AD_Client etc), but is directly and indirectly refrenced. Then, for the sake of preserving references, we would not have to migrate it into the archive.C_BPartner
might or might not be a case for DLM depending on whether there are many partners and on how many of them become “ex-partners” over time.C_OrderLine.C_Order_ID
to C_Order
) because walking “backward” is too big of a performance penaltyAD_Client
is not to be migrated. If we do table partition, then the FK constraints need to be managed accordingly in the underlying database, (i.e. FK from the respective archive tables to the AD_Client
table).It’s important to note that we need the partitioner, even if we “just” go with the single table approach.
This is because to the application, background data won’t be “there”, even with the single table approach (remember the where-clause-view).
Therefore, if we set DLM_ArchiveLevel=1
for a given C_Order
record then we need to do the same for the C_Invoice
records which referenece that order via C_Invoice.C_Order_ID
.
If we don’t, we end up with a “dangling” reference and there will be errors in the application. Other than in the “table partitioning” approach, we don’t have FK constraints to prevent such a situation.
However, if we learn that the single table approach in general can give us a preformance boost already, we might be able to get away with a less rigorous or say “optimistic” partitioner. This partitioner would not add to a partition everything that is required to retain referential integrity, but just what we think usually goes together application-wise.
In that scenario, if an invoice is still within production, but its order can’t be loaded because it is already in background, and if the application needs to load that order, then we do as follows:
We could then provide ourselves with statistics about which records had to be handeled this way, and how often it occured and how much time it cost. We can use these statistics to tune that partitioner.
On the other hand, we can’t handle reports in the same way, because they are “just” SQL, so we can’t trigger this sort of exception handling, and we even might not become aware those problems.
The next section is about how we can solve that.
Also see the section “Idea: a “self-learning” combination of migrator and optimistic partitioner” further below.
As written above, with the single table approach, no FK constraints prevent us from causing “dangling” references within our production data. As long as our reports are essentially SQL-based, we need conclude that without some sort of automatic validation we can’t guarantee on a constant basis that
DLM_ArchiveLevel=0
recordDLM_ArchiveLevel=0
We can’t guarantee it, so might sometimes work out and sometimes not. I see these options:
DLM_ArchiveLevel=0
where-clause).We should be able to add triggers which inforce FK-like integrity among records, within the single table approach. Those triggers would only fire during migration, to there would probably be no performance degradation otherwise.
Example: asuming that there is an FK-Contraint for C_Invoice.C_Order_ID
referencing C_Order.C_Order_ID
. Now we could add a trigger.
This would be the trigger. The name follows our convention for FK constraints, with a prepended “dlm_”:
CREATE CONSTRAINT TRIGGER dlm_corder_cinvoice
AFTER UPDATE OF DLM_ArchiveLevel ON C_Order
DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW
WHEN DLM_ArchiveLevel > 0 /* only fire when we migrate out of production */
EXECUTE PROCEDURE dlm_corder_cinvoice_tgfn;
The trigger function could look like this:
CREATE OR REPLACE FUNCTION dlm_corder_cinvoice_tgfn()
RETURNS trigger AS
$BODY$
BEGIN
SELECT INTO
C_Invoice_C_Invoice_ID, C_Invoice_DLM_ArchiveLevel
i.C_Invoice_ID, i.DLM_ArchiveLevel
FROM C_Invoice i
WHERE
i.C_Order_ID = NEW.C_Order_ID
AND COLAESCE(i.DLM_ArchiveLevel,0)=0;
IF C_Invoice_C_Invoice_ID IS NOT NULL
THEN
RAISE EXCEPTION
'ERROR: Migrating the C_Order record with C_Order_ID=% to DLM_ArchiveLevel=% violates the constraint dlm_corder_cinvoice,
NEW.C_Order_ID, NEW.DLM_ArchiveLevel
USING ERRCODE = '235D3', /* '23503' is defined as foreign_key_violation.. we use 235D3 with D for DLM */
HINT=The C_Invoice with C_Invoice_ID='|| C_Invoice_C_Invoice_ID ||' and DLM_ArchiveLevel='|| C_Invoice_DLM_ArchiveLevel ||' still references that order via its C_Order_ID column',
DETAIL='DLM_Referencing_Table_Name=C_Invoice; DLM_Referencig_Column_Name='C_Order_ID';' /* shall be parsable by metasfresh*/
END IF;
RETURN NULL;
END; $BODY$
LANGUAGE plpgsql VOLATILE;
The above is not yet tested and probably has syntax errors!
I believe we can provide ourselves with means to automatically generate these triggeres and functions for existing FK constraints.
Further reading:
DLM_Partion_Config
DLM_Partion_Config_Version
DLM_Partion_Config_ID
: link to parent tableProcessed
(see below)DLM_PartionLine_Config
DLM_Partion_Config_Version_ID
: link to parent tableAD_Table_ID
: a table that shall be subject to DLM, e.g. C_Invoice
Description
etcDLM_PartionReference_Config
DLM_PartionLine_Config_ID
: link to parent tableDLM_Referencing_Column_ID
: a column of the table DLM_Partion_Config.AD_Table_ID that references a another record, e.g. M_Invoice.C_Order_ID
DLM_Referenced_Table_ID
: the table that is referenced by DLM_Referencing_Column_ID
, e.g. C_Order
DLM_Referencing_PartionLine_Config_ID
: FK-reference to DLM_PartionLine_Config_ID
, if the referenced table is also part of DLM according to this config-version, then this column references the respective line.
Further notes:
C_Invoice
via C_Invoice.C_Order_ID
to the referenced C_Order
, and not from an order to the referencing invoices.About using document-refid
:
C_ReferenceNo_Doc
record for “every” DLM’ed database record. This is a no-go.document-refid
for special cases, e.g. if we want to group a few records and export them for a test case, but not for the actual DLM-ish partitioning. We can later add such a feature on-top, if we want to.So, we store partitions by
Adding another column DLM_Partion_ID
to every DLM’ed table. By this table, each record can be long to one partition, which is enough for DLM.
Important: DLM_Partion_ID
shall be no FK constraint. See the notes below
DLM_Partion
DLM_Partion_Config_Version_ID
DLM_ArchiveLevel
: level of this recordDLM_Partion_ArchiveLevel
: level of the records that belong to the partitionDLM_Partion_Target_ArchiveLevel
: set by the “Coordinator” component (see below) in case the partition can be archivedDLM_Revalidate_After
: timestamp set by the “Coordinator” component (see below), in case the partition can not yet be archived and shall be revalidated in future.DLM_Partion_Index
DLM_Partion_Config_Version_ID
DLM_Partion_ArchiveLevel
: level of the records that belong to the indexed partitionsDLM_Partion_From_ID
DLM_Partion_To_ID
Notes:
DLM_Partion
records can be temporary or can also themselves be DLM’ed. The important non-redundant informations they hold is DLM_Partion_Target_ArchiveLevel
and . These infos are important while the partition was created but not yet archived.
In addition can use them to store detailed information (audit, debug etc), but we can also drop them.DLM_Partion
records in the first place, if the records are directly moved to their designated DLM_Partion_Target_ArchiveLevel
in one go.DLM_Partion_Index
can’t be DLM’ed or archived. But I think we dan do with a relatively small number of DLM_Partion_Index
records
that cover large ranges of consecutive DLM_Partion_ID
values which all belong to the same DLM_ArchiveLevel
and were all partioned using the same DLM_Partion_Config_Version_ID
.By coordinator I mean a component that checks out existing partitions and decides what to do this them. Not every partition shall “blindly” be migrated.
Processed='Y'
, but also Paid='Y'
to be trueDLM_Partion.DLM_Revalidate_After
).DLM_Partion.DLM_Partion_Target_ArchiveLevel
).DLM_Partion_Log
DLM_Coordinator
Description
AD_JavaClass_ID
which points to the implementation codeBy migrator I mean a component that can receive partitions and is responsible to migrate them from one “level” to the other.
DLM_ArchiveLevel
columns.DLM_ArchiveLevel
.An extension in metasfresh that by default always adds an additional condition to the where clause. The “normal” metasfresh code shall not have to care. In normal operation, the background data just seems to be not to exist in the database.
We can achive this using the “view” that was introduced in the “Single table” section.
Outline:
C_Invoice
, we just add the table C_Invoice
to the config.C_Invoice
record each.C_Invoice
record that is referenced by a C_Payment
which is not also migrated, then our trigger will raise an error with error-code 235D3
.This way, a user can configure what he/she knows to or wants to add. The rest will be added automatically and might then be further customized by the user.
Update to postgresql-9.5 (TODO)
postgresql.conf
(don’t forget comments/docs)AD_Table
. It indicates if e.g. C_BPartner
is just a view and the actual table is named C_BPartner_Tbl
.set_config()
to make sure the rest of metasfresh allways gets the data it needs in the given context.DLM_ArchiveLevel
field to Ini.java and the preferences dialog so that a user can include archived records in the standard client,
as long as we didn’t implement any deep archiving which would required the data to be refetched from so background storage.
In other words, as long as we go with a single-database solution, all data shall be retrievable with a sufficiently high DLM_ArchiveLevel
.alter table M_HU_Attribute add column DLM_ArchiveLevel integer
is fast, because it just adds the column to the table metadata, but not assignes a value to any row.