So a question that keeps raising its head in different guises “I've been told I'm on the hook for lineage, what should we do?”.  Now strictly the question has normally found its way into Technology’s hands before someone has an opinion of which tool would solve this - aka the “How”. Unfortunately the “What” isn't really known, and worse yet the “Why” often isn't clearly articulated outside of “because the regulator needs it”.

So for this article I wanted to broach a perspective on how all of these parts tie together. Moreover by the end of the article we should have some working definitions that can be leveraged to provide a clear language of data movement concepts that can be enabled to help answer the “Why”.

Data Requirement Scenarios

Before going to the “How”, let's look at several scenarios for data requirements. Note for full transparency and completeness most of my experiences have been in the finance sector, and so will have a natural bias. If there are additional comments related to more scenarios in Finance or other industries I will update the article to include those.

Requirements Description

Regulatory Lineage

As part of the financial crisis that started in 2007 the financial regulators found that although Financial Institutions (FI) could process their data it was clear that the control and understanding of the data was lacking. As such “Data Lineage” was requested of the FI to demonstrate they knew “What” data was important to them and “Where” it was in their data ecosystem.
Example requirement: BCBS 239 / RDARR

Data Protection (privacy)

As the implications of uncontrolled sharing of personal data is being understood. Legislation is being created to protect individual’s privacy rights within specific jurisdictions.

Seven recommended principles for protection of personal data are:

  1. Notice—data subjects should be given notice when their data is being collected;
  2. Purpose—data should only be used for the purpose stated and not for any other purposes;
  3. Consent—data should not be disclosed without the data subject’s consent;
  4. Security—collected data should be kept secure from any potential abuses;
  5. Disclosure—data subjects should be informed as to who is collecting their data;
  6. Access—data subjects should be allowed to access their data and make corrections to any inaccurate data;
  7. Accountability—data subjects should have a method available to them to hold data collectors accountable for not following the above principles.

Example requirement: GDPR, EU Data Protection

Data Protection

Legislation requiring private or governmental entities to notify individuals of security breaches of information involving personally identifiable information.

Example requirement: National Conference of State Legislatures

Data Landscape Simplification

Digital data is easily copied, but hard to control. Organisations often are under pressure to add data capacity, but there is little recognition as to the cost and control consequences of this data explosion. Increasingly technology is tasked to reduce costs and increase responsiveness via transformational simplification programmes.

Aggregation of Similar Elements & Alignment of Meaning

With the focus on data processing the actual meaning of the content is overlooked. This often-meant data was created to meet a specific need, but incorrectly reused later for a different use. The most common examples of this occur downstream in aggregation platforms (e.g. data warehouses, data marts, etc.).

For example, within the same organisation different lines of business (LOB) all report “Net Operating Income (NOI)” - but the actual calculation was performed differently at each LOB. Therefore, while each knew what they meant to the specific use, they couldn't necessarily be trivially aggregated to give an enterprise view of NOI. As such, understanding what an element means is a key requirement to ensure aggregation produces the expected results.

Example requirement: BCBS 239 / RDARR

Chain of Custody

Chain of custody is normally restricted to computer forensic seizure. All information on an electronic file’s travels from its original creation version to its final production version. A detailed account of the location of each document/file from the beginning of a project until the end. A sound chain of custody verifies that you have not altered information either in the copying process or during analysis. If you cannot show the chain of custody, you may have a difficult time disproving that outside influences might have tampered with the data.

Suspicious or illegal activity detection

Laws and regulations apply to organizations to ensure they are not used as a vehicle for illegal or restricted activities.

Common examples on financial organizations include the requirement to investigate anti-money laundering, know your customer (KYC), fraud, and sanctions.

Example requirement: Department of Financial Services: Part 504 (DFS 504), FINRA Anti-Money Laundering, UK Money Laundering Regulations 2007

Data Trust

As part of making business decisions based off data, the trustworthiness of the input data needs to be understood in order to make a qualified decision. Since the trustworthiness of data modifies the amount of weight it factors into a final decision, being able to quantify how much trust can be put into any data set is fundamental to appropriately using that to make an informed decision.

Data Management Tools

Given these requirements, what data management tools do we have to help answer these requirements? The below table is a set of definitions to help articulate some of these tools, however I should also point out that I’ve included some broader concepts to hopefully highlight and differentiate some data management concepts for more generalized use.

Term Meaning
Data Flow is the flow of data from one point to another, without involving any intermediaries at a specific level of granularity, to transport data. A data flow may be identified at different levels of granularity (e.g. Organization, Application, Transport layer).
Process Flow is the transition from one business process to another, and may encapsulate a Data Flow as part of the transition.
(general term)
1. descent in a line from a common progenitor
2. a group of individuals tracing descent from a common ancestor; especially:  such a group of persons whose common ancestor is regarded as its founder

Data Lineage is a record of the sequence of Data Flow's involving a data element from the reference point of interest back to a point that meets the agreed to requirements or risk appetite. The information captured of the movement may additionally include data elements that were used to create the requested data element.
Process Lineage is a record of Process Flow's between business processes to perform a specific business function.
(general term)
is a record of ownership of a work of art or an antique, used as a guide to authenticity or quality.
Data Provenance is the documentation of data in sufficient detail to allow reproducibility of a specific dataset.
[Data] Provenance can be used for many purposes, such as understanding how data was collected so it can be meaningfully used, determining ownership and rights over an object, making judgements about information to determine whether to trust it, verifying that the process and steps used to obtain a result complies with given requirements, and reproducing how something was generated.
Data Classification is the process of sorting and categorizing data into various types, forms or any other distinct class. Data classification enables the separation and classification of data according to data set requirements for various business or personal objectives.

This can be of particular importance for risk management, legal discovery, and compliance.

(general term)
is a general term indicating something can be linked to another artifact.
Data Traceability is the ability to track a data construct back to the construct it was derived from as a more concrete instantiation.

Examples include a physical column may trace back to a logical attribute, which in turn may trace to a business term, and that traces back to a concept. Data Traceability is commonly confused with data lineage and data provenance.

Link Analysis is a data-analysis technique used to evaluate relationships (connections) between nodes within a given area and determine if the relationship is material within a given scenario.

Commonly confused with Data Traceability.

Data Quality is the quantitative assessment of a dataset using a set of data quality rules that verify the dataset is fit for its intended purpose.
Provisioning Point is a mechanism to distribute data to help ensure the appropriate source of data is used throughout the organization.

Requirement and Data Management Tool Alignment

Given these definitions how do we use these tools to help meet the requirements. To solve the holistic problem often requires additional tools outside of the scope of the data management tool kit, and as such the tools identified below are those only within the data management scope.

Data Management Requirements Data Management Tools
Regulatory Lineage Data Lineage
Data Protection (privacy) Data Classification, Data Flow
Data Protection (breach) Data Classification, Data Flow
Data landscape simplification Data Classification, Data Flow
Aggregation of similar elements & Alignment of meaning Data Traceability
Chain of custody Data Provenance
Data Trust Data Quality, Data Provenance, Provisioning Point
Suspicious or illegal activity detection Link Analysis, Data Lineage


Interestingly Data Provenance isn’t used outside of Chain of Custody and Data Trust, and that may surprise some. It seems there is often a blurring of lines when dealing with Data Provenance and Data Lineage. The general rule I use to clarify which one is appropriate to use is:

Q: Do you need to record a specific date and time that the data was processed/handled in an activity?

If the answer is yes then that indicates you need “Data Provenance”, if the timestamp and location is not required for a specific instance of data then you are most likely looking at “Data Lineage”.


In summary the article covers various data management scenarios that often come up, and provides language and definitions for the data management tools to help solve for those requirements. As such I hope we can agree to the differences between Data Flows, Data LineageData Provenance and Data Traceability.

Please post any comments on this LinkedIn page.

Author: Gareth Isaac - See my LinkedIn profile