Sunday, May 21, 2023

 Data Lineage

  • https://a-teaminsight.com/wp-content/uploads/2019/03/A-Team-Group_Data-Lineage-Handbook-2019-1.pdf
  • https://neo4j.com/blog/internal-risk-models-frtb-data-lineage/ 
  • https://www.marklogic.com
  • http://www.datamanagementinsight.com
  • https://www.mckinsey.com/~/media/mckinsey/business%20functions/risk/our%20insights/frtb%20reloaded%20the%20need%20for%20a%20fundamental%20revamp%20of%20trading%20risk%20infrastructure/frtb-the-need-for-a-fundamental-revamp-of-trading-risk-infrastructure-web-final.ashx
  • https://getmanta.com
  • https://www.ticksmith.com/use-case-ticksmith-data-pooling-platform-empowers-banks-to-pool-data-for-frtb
  • https://www.mckinsey.com/~/media/mckinsey/business%20functions/risk/our%20insights/frtb%20reloaded%20the%20need%20for%20a%20fundamental%20revamp%20of%20trading%20risk%20infrastructure/frtb-the-need-for-a-fundamental-revamp-of-trading-risk-infrastructure-web-final.ashx
  • https://datacrossroads.nl/2019/03/10/data-lineage-101/
  • https://datacrossroads.nl/2019/03/13/data-lineage-102/
  • https://datacrossroads.nl/2019/03/17/data-lineage-103/
  • https://datacrossroads.nl/2019/03/20/data-lineage-104/

  • https://productresources.collibra.com/docs/collibra/latest/Content/CollibraDataLineage/TechnicalLineage/ref_technical-lineage-viewer.htm


Data lineage traces data from source to destination, noting every move the data makes and taking into account any changes to the data during its journey. 

Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way—how the data was transformed, what changed, and why.

Initially implemented without specific regulatory requirements to track data across individual data management projects, data lineage rose to prominence following the implementation of BCBS 239 in January 2016, a Basel Committee on Banking Supervision (BCBS) rule designed to improve data aggregation and reporting across financial markets, as well as accountability for data. This required improvements in data governance and data lineage that have since been reinforced by other regulations and financial institutions’ recognition of the importance of accurate, complete and sustainable data lineage.


What is data lineage?

Data lineage covers the lifecycle of data, from its origins, through what happens to the data when it is processed by different systems, and where it moves from and to over time. It can be applied to most types of data and systems, and is particularly valuable in complex, high volume data environments. It is also a key element of data governance, providing an understanding of where data comes from, how systems process the data, how it is used and by whom. It also plays well into improving data quality.

Scope of data lineage implementation is often determined by regulatory requirements, enterprise data management strategy, data impact and critical data elements.

The use of data lineage for regulatory compliance is slightly different depending on the specific regulatory data requirement, but the overall theme is the same.

Few firms can claim complete and entirely successful data lineage, but most have developed a regulatory response that is beginning to yield operational and business benefits

Data lineage is sometimes referred to as technical lineage, which represents the flow of physical data through underlying applications, services and data stores,

or business lineage, which requires the same underlying technicalities but is perceived as a driver of business intelligence and better business decisions.

By building a picture of how data flows through an organization and is transformed from source to destination, it is possible to create complete audit trails of data points, an aspect of lineage that has become increasingly necessary to meeting regulatory requirements and ensuring data integrity for the business.

While data lineage helps to track data and identify different processes involved in the data flow and their dependencies, metadata management – the management of data that describes data – is key to capturing enterprise data flow and presenting data lineage. Data lineage solutions based on metadata collect and integrate consistent end-to-end metadata throughout an organization, and create a metadata repository that is accessible and makes complete data lineage information available to different user groups.

Data lineage is usually represented visually to show the movement of data from source to destination, changes to the data and how it is transformed by processes or users as it moves from one system to another across an enterprise, and how it splits or converges after each move. Visualization can demonstrate data lineage at different levels of granularity, perhaps at a high level providing data lineage that shows what systems data interacts with before it reaches destination. As the granularity increases,

it is possible to provide detail around the particular data such as its attributes and the quality of the data at specific points in the data lineage.

The scope of lineage implementation is often determined by regulatory requirements, enterprise data management strategy, data impact and critical data elements of an organization. It is not necessary to boil the ocean, but instead identify regulatory requirements for data lineage and business areas to which its application is beneficial.

In many financial firms, users of data lineage include business managers and analysts, compliance professionals, strategy developers, data governance teams, data modelers, and IT management, development and support.

Importance of data lineage

Data lineage is critical to both regulatory compliance and business opportunity. From a regulatory perspective, compliance has been tightened up considerably since the 2008 financial crisis with subsequent regulations been designed to avoid a repeat of similar circumstances. Rather than merely producing reports for compliance, these regulations – including BCBS 239, Markets in Financial Instruments Directive II (MiFID II), General Data Protection Regulation (GDPR), Fundamental Review of the Trading Book (FRTB) and the Comprehensive Capital Analysis and Review (CCAR) – now require firms to implement data lineage to demonstrate exactly how they came to the results published in reports. Using data lineage, firms can not only prove the accuracy of results, but also take a proactive approach to identifying and fixing any gaps in required data.

Complete data lineage can also reduce the burden of regulation by providing operational transparency and reducing risk and costs. Its metadata can help firms consolidate regulatory reporting by identifying data that is used across numerous regulations and move towards processing the data once for multiple purposes. Similarly, metadata for data lineage can ease the burden and cost of implementing new regulations.

From a business perspective, and at a base level, data lineage helps firms stay on the right side of regulators and avoid the penalties of non-compliance. Equally important, it helps firms gain an understanding of their data and the impact on data of any changes to strategy, systems and processes. With an understanding of data, firms can gain the benefits of data lineage beyond compliance, including the ability to spot new business opportunities, make better decisions, increase efficiency and reduce costs.

Regulatory drivers

Regulations driving financial institution to implement data lineage include those noted above and detailed here. The use of data lineage in each case is slightly different depending on the specific regulatory data requirement, but the overall theme is the same, to be able to demonstrate where data originated, trace its journey through an organization, and prove how it has been changed along the way.

BCBS 239

Basel Committee on Banking Supervision Rule 239 (BCBS 239) came into force on January 1, 2016 and is designed to improve risk data aggregation and reporting. It is based on

14 principles that underpin accurate risk aggregation and reporting in normal times and times of crisis. To achieve compliance, banks must capture risk data across the organization, establish consistent data taxonomies, and store data in a way that makes it easily accessible and straightforward to understand.

Data lineage requirement: Data lineage must be implemented to support risk aggregation, data accuracy and reporting. Also, and conversely, to ensure risk data can be traced back to its origin and risk reports can be defended.

MiFID II

Markets in Financial Instruments Directive II (MiFID II) is a principles based directive issued by the EU. It took effect on January 3, 2018, and aims to increase transparency across Europe’s financial markets and ensure investor protection. The demand for reference and market data for both pre- and post-trade transparency, including trade reporting and transaction reporting, is unprecedented, leading to data management challenges including sourcing required data, reporting in near real-time, and uploading reference and market data to MiFID II mechanisms including Approved Publication Arrangements (APAs) and Approved Reporting Mechanisms (ARMs).

Data lineage requirement: MiFID II operations can benefit from data lineage in a number of ways. Lineage can be used to identify any gaps in trade reporting

data, and any similarities across numerous regulatory reporting obligations. It can also be used to map MiFID II reporting data from source systems to APAs and ARMs and vice versa.


GDPR

General Data Protection Regulation (GDPR) is an EU data privacy regulation that came into force on May 25, 2018. It is designed to harmonize data privacy laws across Europe and protect EU citizens’ data privacy. The requirements of GDPR include gaining explicit consent to process personal data, giving data subjects access to their personal data, ensuring data portability, notifying authorities and individuals of data breaches, and giving individuals the right to be forgotten.


Data lineage requirement:

Firms subject to GDPR are dependent on data lineage to track data and provide transparency about where it is and how it used. Data lineage provides the ability to demonstrate compliance with the regulation and, from a data subject’s perspective, supports access to personal data and the execution of other rights such as the right to be forgotten.


FRTB

Fundamental Review of the Trading Book (FRTB) regulation will take effect in 2022. It is a response to the 2008 financial crisis, which exposed fundamental weaknesses in the design of the trading book regime, and focuses on a revised internal model approach to market risk and capital requirements, a revised standardized approach, a shift from value at risk to an expected shortfall measure of risk, incorporation of the risk of market illiquidity, and reduced scope for arbitrage between banking and trading books. Its data management challenges include data sourcing, facilitating capital calculations, and gathering historical data as well as real- price observations for executed trades, or committed quotes, to meet requirements around non-modellable risk factors (NMRFs) and the linked risk factor eligibility test. Data lineage requirement: To satisfy the demands of FRTB, data lineage may be needed to track historical data and trade data aggregation required for the risk factor eligibility test of NMRFs, essentially the provision of at least 24 real price observations of the value of the risk factor over the previous 12 months. To satisfy the demands of FRTB, data lineage may be needed to track historical data and trade data aggregation required for the risk factor eligibility test of NMRFs

CCAR

The Comprehensive Capital Analysis and Review (CCAR) is an annual exercise carried out by the Federal Reserve to assess whether the largest bank holding companies (BHCs) operating in the US have sufficient capital to continue operations throughout times of economic and financial stress, and have robust, forward-looking capital planning processes that account for their unique risks. From a data management perspective, CCAR requires data sourcing, analytics and risk data aggregation for stress tests designed to assess the capital adequacy of BHCs and for regulatory reporting purposes.

Data lineage requirement: CCAR requires attribute level data lineage to track data from source to destination and ensure the validity and veracity of capital plans. Data lineage can also be used to identify any data gaps in reporting and highlight any data quality issues.

Supply and demand

Over the past few years, a number of established data management vendors have brought data lineage solutions to market, as have start- ups and young companies dedicated to lineage. Some take a technical approach, others a business approach, but their common challenge is to meet growing market demand for automated data lineage that can cross complex data environments and ensure regulatory compliance and deliver business benefit. On the demand side, recognition and adoption of data lineage has tracked increasing regulation since the financial crisis. Few firms can claim complete and entirely successful systems, but most have developed a regulatory response that is beginning to yield operational and business benefits.

Unsurprisingly, most progress has been made at Tier 1 banks and other large organizations subject to extensive regulation and with the resources to implement data lineage, although all firms that want to stay in the game are likely to need data lineage across some aspects of their business going

Challenges and opportunities

Overview - Like most data management programs, data lineage includes inherent challenges and potential opportunities. The challenges range from winning management buy-in for initial projects to understanding and tracking huge volumes of data with complex links across a big data environment. The opportunities range from improved data quality to better decision making and identifying business opportunities.

Challenges - The challenges of data lineage tend to fall into three buckets – operations, technology and data management – and while many are ongoing pain points for data managers across all sorts of programs, some are specific to data lineage.

Operational challenges - The operational challenges of data lineage start with winning management buy- in and funding for a solution that can be expensive, requires significant human input, and offers only a modicum of advantage in early implementation. Poor understanding of data lineage and its potential benefits by senior executives can stymie approval, while the prospect of lengthy and complex projects could be enough to bring the shutters down.

Questions to consider at the outset of a data lineage project include:

  • Where are we now, why do we need data lineage?
  • What extent of lineage would be optimal?
  • How can we win management buy-in?
  • Do we need a champion for data lineage?
  • How much will it cost now and going forward?
  • How much can we do with allocated budget?
  • Do we have required skills internally?
  • What are the internal cultural issues of data lineage?

You can begin to answer these questions by ensuring senior management understands the importance of data and benefits of data lineage, and starting small. Decide whether a pilot project is going to provide insight into business processes or achieve an element of regulatory compliance, prioritize the most important and relevant data, scope the project carefully, and identify stakeholders that should be involved.

In the first instance, it may be useful to assess where required data comes from manually and create baseline data lineage before considering automation. It is also important to make sure a pilot project is scalable and could include additional data or other areas of the organization before making a business case.

Proving the concept of data lineage and demonstrating quick wins to the business should, at least in some cases, be enough to start the journey towards a larger data lineage program spanning part or all of the organization.

While a good start to any data management project means it should gain momentum, the success of data lineage is particularly dependent on people and their approaches. It takes a range of data and metadata management skills to develop and maintain data lineage, but if data producers and consumers don’t see its value, they are unlikely to fall in with the cause and follow carefully created data lineage processes. These producers and consumers need to look beyond their own environment and understand how the organization can benefit from data lineage.

That is not to say any data lineage. As data lineage can be expensive to build and manage, it is important to understand what level of data lineage users require. Depending on resources, it may or may not be possible to match extensive requirements, so the initial aim must be to build a data lineage solution that delivers value and is right-sized for consumers, with later iterations providing more detail around data and data flows.

Data ownership and accountability is an ongoing challenge that many organizations with huge amounts of data, myriad systems and applications, and little appetite among employees to take responsibility for data have failed to resolve. Data lineage isn’t a silver bullet, but by tracking data and showing how it is used and by whom, it does add some clarity to data and allows responsibility for specific areas of data to be allocated to their rightful owners.


Technology challenges

The technology challenges of data lineage reflect growing numbers of regulations with overlapping lineage requirements and smarter auditors and regulators asking for responses to questions on demand. Advances in technology add to the challenge, with cloud- based applications and services, and big data systems – not to mention emerging machine learning, artificial intelligence and natural language processing technologies – creating a complex data infrastructure. Data can be managed in new and interesting ways, but keeping track of it and ensuring it can be trusted is increasingly difficult.

At the heart of addressing these challenges, and a challenge in itself, is the selection of a solution, or solutions, to support an organization's data lineage. Early implementations of data lineage were often built in-house as few vendor solutions were available, more recently many firms have moved to hybrid in-house and vendor solutions, or migrated entirely to vendor solutions as data lineage has advanced towards becoming a commodity.

Whether you plan to build or buy, these questions are worth considering before final decisions are made:

  • How much lineage is already in place?
  • To what extent will manual lineage continue to be necessary?
  • How will lineage be documented?
  • How will it need to be scaled? 
  • How will impact assessment be managed?
  • What is the long-term aim for automation?
  • Which areas of the organization will be covered and at what level in terms of technical and business lineage?
  • How will data lineage be sustained?
  • What skills will be required? • How much will it cost?


There are no catch-all answers to these questions and few organizations that will find answers to all the questions in one solution, leading many to implement a combination of in-house developed and vendor deployed solutions.

Whatever the selected solution, however, it will not provide value in isolation. It is important to consider how data lineage and its metadata will integrate with the rest of an organization's business metadata as this will provide rich data and the ability to slice and dice the data. Lineage also needs to run alongside an organization's systems development lifecycle plan to ensure it is maintained

as technologies are changed. And, of course, scalable and flexible technology is essential, not only to master growing volumes of existing data types, but also to embrace additional datasets, alternative data, data resulting from mergers and acquisitions, and data that we have yet to discover.

Data management challenges

Implementing data lineage is a complex data management task that could include huge volumes of data, the creation of metadata, multiple legacy systems, mountains of spreadsheets, disparate systems, siloed data, uncharted data flows and mixed data formats.

The potential impact of regulatory change must also be assessed, data quality considered, and manual processes brought into the lineage framework.

Big data, data lakes and repositories raise issues around how data is stored, tagged and linked to other data and systems, while outsourced data and automated data feeds need to be mined and brought into the data lineage scheme.

Data management questions that need to be considered before data lineage is implemented include:

  • Is all the data valuable?
  • Is the data duplicated?
  • Is some of the data redundant? • Is the data internal?
  • Is the data external and correctly licensed?
  • What tools are required to find answers to these questions?


Reflecting these questions, an early inventory of an organization's data can start the process of identifying which data is important to the business and should be part of a data lineage program, which data can be left as is, and which data can be scrapped. Data in legacy systems and black boxes will difficult, if not impossible, to capture, as will data that changes continually but not consistently.

Considering the scope and scale of these data management challenges, particularly in large organizations, data lineage utopia is not in sight, but there are tools and solutions that

can break the backbone of implementation and provide a sturdy platform on which to build and maintain data lineage that can provide useful and timely information to the business.

Data Lineage related regulations:

  • BCBS 239
  • Fundamental Review of the Trading Book (FRTB) - 
  • Markets in Financial Instruments Directive II (MiFID II)
  • General Data Protection Regulation (GDPR)
  • Comprehensive Capital Analysis and Review (CCAR)


No comments:

Post a Comment