UK: +44-2032393823
US: +1-904-688-0395
Information: enquiries@ortecha.com
So a question that keeps raising its head in different guises “I've been told I'm on the hook for lineage, what should we do?”. Now strictly the question has normally found its way into Technology’s hands before someone has an opinion of which tool would solve this - aka the “How”. Unfortunately the “What” isn't really known, and worse yet the “Why” often isn't clearly articulated outside of “because the regulator needs it”.
So for this article I wanted to broach a perspective on how all of these parts tie together. Moreover by the end of the article we should have some working definitions that can be leveraged to provide a clear language of data movement concepts that can be enabled to help answer the “Why”.
Data Requirement Scenarios
Before going to the “How”, let's look at several scenarios for data requirements. Note for full transparency and completeness most of my experiences have been in the finance sector, and so will have a natural bias. If there are additional comments related to more scenarios in Finance or other industries I will update the article to include those.
Requirements | Description |
---|---|
As part of the financial crisis that started in 2007 the financial regulators found that although Financial Institutions (FI) could process their data it was clear that the control and understanding of the data was lacking. As such “Data Lineage” was requested of the FI to demonstrate they knew “What” data was important to them and “Where” it was in their data ecosystem. Example requirement: BCBS 239 / RDARR |
|
Data Protection (privacy) |
As the implications of uncontrolled sharing of personal data is being understood. Legislation is being created to protect individual’s privacy rights within specific jurisdictions. Seven recommended principles for protection of personal data are:
Example requirement: GDPR, EU Data Protection |
Data Protection |
Legislation requiring private or governmental entities to notify individuals of security breaches of information involving personally identifiable information. Example requirement: National Conference of State Legislatures |
Data Landscape Simplification |
Digital data is easily copied, but hard to control. Organisations often are under pressure to add data capacity, but there is little recognition as to the cost and control consequences of this data explosion. Increasingly technology is tasked to reduce costs and increase responsiveness via transformational simplification programmes. |
Aggregation of Similar Elements & Alignment of Meaning |
With the focus on data processing the actual meaning of the content is overlooked. This often-meant data was created to meet a specific need, but incorrectly reused later for a different use. The most common examples of this occur downstream in aggregation platforms (e.g. data warehouses, data marts, etc.). For example, within the same organisation different lines of business (LOB) all report “Net Operating Income (NOI)” - but the actual calculation was performed differently at each LOB. Therefore, while each knew what they meant to the specific use, they couldn't necessarily be trivially aggregated to give an enterprise view of NOI. As such, understanding what an element means is a key requirement to ensure aggregation produces the expected results. Example requirement: BCBS 239 / RDARR |
Chain of Custody |
Chain of custody is normally restricted to computer forensic seizure. All information on an electronic file’s travels from its original creation version to its final production version. A detailed account of the location of each document/file from the beginning of a project until the end. A sound chain of custody verifies that you have not altered information either in the copying process or during analysis. If you cannot show the chain of custody, you may have a difficult time disproving that outside influences might have tampered with the data. |
Suspicious or illegal activity detection |
Laws and regulations apply to organizations to ensure they are not used as a vehicle for illegal or restricted activities. Common examples on financial organizations include the requirement to investigate anti-money laundering, know your customer (KYC), fraud, and sanctions. Example requirement: Department of Financial Services: Part 504 (DFS 504), FINRA Anti-Money Laundering, UK Money Laundering Regulations 2007 |
As part of making business decisions based off data, the trustworthiness of the input data needs to be understood in order to make a qualified decision. Since the trustworthiness of data modifies the amount of weight it factors into a final decision, being able to quantify how much trust can be put into any data set is fundamental to appropriately using that to make an informed decision. |
Data Management Tools
Given these requirements, what data management tools do we have to help answer these requirements? The below table is a set of definitions to help articulate some of these tools, however I should also point out that I’ve included some broader concepts to hopefully highlight and differentiate some data management concepts for more generalized use.
Term | Meaning |
---|---|
Data Flow | is the flow of data from one point to another, without involving any intermediaries at a specific level of granularity, to transport data. A data flow may be identified at different levels of granularity (e.g. Organization, Application, Transport layer). |
Process Flow | is the transition from one business process to another, and may encapsulate a Data Flow as part of the transition. |
Lineage (general term) |
1. descent in a line from a common progenitor 2. a group of individuals tracing descent from a common ancestor; especially: such a group of persons whose common ancestor is regarded as its founder |
Data Lineage | is a record of the sequence of Data Flow's involving a data element from the reference point of interest back to a point that meets the agreed to requirements or risk appetite. The information captured of the movement may additionally include data elements that were used to create the requested data element. |
Process Lineage | is a record of Process Flow's between business processes to perform a specific business function. |
Provenance (general term) |
is a record of ownership of a work of art or an antique, used as a guide to authenticity or quality. https://en.oxforddictionaries.com/definition/provenance |
Data Provenance | is the documentation of data in sufficient detail to allow reproducibility of a specific dataset. [Data] Provenance can be used for many purposes, such as understanding how data was collected so it can be meaningfully used, determining ownership and rights over an object, making judgements about information to determine whether to trust it, verifying that the process and steps used to obtain a result complies with given requirements, and reproducing how something was generated. https://www.w3.org/TR/prov-primer/ |
Data Classification | is the process of sorting and categorizing data into various types, forms or any other distinct class. Data classification enables the separation and classification of data according to data set requirements for various business or personal objectives.
https://www.techopedia.com/definition/13779/data-classification This can be of particular importance for risk management, legal discovery, and compliance. |
Traceability (general term) |
is a general term indicating something can be linked to another artifact. |
Data Traceability | is the ability to track a data construct back to the construct it was derived from as a more concrete instantiation.
Examples include a physical column may trace back to a logical attribute, which in turn may trace to a business term, and that traces back to a concept. Data Traceability is commonly confused with data lineage and data provenance. |
Link Analysis | is a data-analysis technique used to evaluate relationships (connections) between nodes within a given area and determine if the relationship is material within a given scenario.
Commonly confused with Data Traceability. |
Data Quality | is the quantitative assessment of a dataset using a set of data quality rules that verify the dataset is fit for its intended purpose. |
Provisioning Point | is a mechanism to distribute data to help ensure the appropriate source of data is used throughout the organization. |
Requirement and Data Management Tool Alignment
Given these definitions how do we use these tools to help meet the requirements. To solve the holistic problem often requires additional tools outside of the scope of the data management tool kit, and as such the tools identified below are those only within the data management scope.
Data Management Requirements | Data Management Tools |
---|---|
Regulatory Lineage | Data Lineage |
Data Protection (privacy) | Data Classification, Data Flow |
Data Protection (breach) | Data Classification, Data Flow |
Data landscape simplification | Data Classification, Data Flow |
Aggregation of similar elements & Alignment of meaning | Data Traceability |
Chain of custody | Data Provenance |
Data Trust | Data Quality, Data Provenance, Provisioning Point |
Suspicious or illegal activity detection | Link Analysis, Data Lineage |
Interestingly Data Provenance isn’t used outside of Chain of Custody and Data Trust, and that may surprise some. It seems there is often a blurring of lines when dealing with Data Provenance and Data Lineage. The general rule I use to clarify which one is appropriate to use is:
Q: Do you need to record a specific date and time that the data was processed/handled in an activity?
If the answer is yes then that indicates you need “Data Provenance”, if the timestamp and location is not required for a specific instance of data then you are most likely looking at “Data Lineage”.
Conclusion
In summary the article covers various data management scenarios that often come up, and provides language and definitions for the data management tools to help solve for those requirements. As such I hope we can agree to the differences between Data Flows, Data Lineage, Data Provenance and Data Traceability.
Please post any comments on this LinkedIn page.
Author: Gareth Isaac - gareth.isaac@ortecha.com