Data Analysis and Data Integrity – part #1

A common refrain that I hear is, “We can’t rely on the data because it does not have integrity.” This raises a couple of questions in my mind and should in yours as well.  First, what is management using to produce its reports and make decisions?  Second, how accurate does your data have to be to allow you to perform analytics and arrive at valid recommendations/conclusions.  The obvious answer to the first question is, “the same data you do not feel has sufficient integrity to be used for analytics.”  The answer to the second question is the ubiquitous, “Depends.”

A main role for internal audit is the assessment of the integrity of information for management decision-making.  One of the requirements for effective risk mitigation  is  complete, accurate and timely access to information that can be used to drive management decision making.  Management is producing reports with the data, and you can identify integrity issues, then you have a valuable finding.  If management was previously unaware that there is incomplete, in-accurate, obsolete, or conflicting data, they will now question the confidence of their data-based decision making.  Even if management already knows that there are integrity issues and are using the data anyway, you can analyze the integrity issues to identify the root cause, control weaknesses, and to provide specific recommendations to improve the integrity. The improvement to the integrity will lead to more reliable and useful analytics – a valuable finding contribution to management decision making. 

Not being able to reliable on the data because of integrity issues is a finding in and of itself.  Identifying the cause of the integrity issues and making relevant recommendations in the cherry on top.  Therefore, you should never have concerns about whether you can use the data or not.  Employ analytics that will assess the integrity first and address these issues if they exist; then you are good to go.

As an aside, if you are using analytics to test IT controls, the existence of integrity issues is a likely sign that input, processing, or output controls are not working as intended.  I always check the reasonableness of key data fields such as invoice, payment, and entry dates.  Finding dates ten years in the future or past is typical and points to a lack of input date verification controls.

Before I discuss the “depends” issue, it is appropriate to outline the attributes of data integrity. While data quality refers to whether data is reliable and accurate, data integrity goes beyond data quality. Data integrity requires that data be not only accurate, but also complete, consistent, and in context. Data integrity is what makes the data useful to its owner.

Completeness – A data record, such as HR employee data, must be complete to satisfy the needs of all its consumers. For example, HR cares about whether the employee is active or not, as well as their location, while Payroll wants to pay the correct amount in a timely manner. Gaps in the employee data can hinder an organization’s ability to treat employees fairly.

Uniqueness – no duplicates.  Assets should not be recorded twice; employees should only be in the master data table once.  Duplicate records impact analysis and management decisions.

Timeliness – entry should be made as soon as possible after the event being recorded happens.  I have seen cases where vendor invoices sit on the certifying manager’s desk for months before being approved and sent to A/P for entry; changes to employee status that affect their pay but are not entered into the payroll system for weeks and weeks (resulting in under- or over-payments).

Accuracy – wrong or misleading data helps no one. The cause of inaccuracy has many factors including manual input errors, interface or mishandled conflicting data between sources, processing errors, as well as from ineffective analytics that miss or double count records.   (Aside: based on 35 years of performing analytics in a variety of organizations, I can confidently state that 20-30 percent of standard reports either do not report on what they say they do, or have inaccuracies, unstated filters, etc.  The result is inaccurate reporting.)

Validity  – refers to the accuracy of a measure (whether the results represent what they are supposed to measure). If you only have a single source of data, then it is likely to be consistent, but could be consistently wrong. To truly verify the data, it must be checked against its source, such as the vendor invoice.

Availability/Accessibility – accurate data must be provided to the proper functions at the right time when it is required. If management needs data today to make a decision, providing the data next week is not useful.  A lack of proper and easy access and retrieval is detrimental to the business.  It can result in delayed or inaccurate decisions.  Further, the data must be accessible in a format that allows it to be used for the ultimate purpose.  Proving a printed report with thousands of pages, does not support additional drilldown and analysis.

Traceability – data must be properly maintained throughout its life cycle – from data capture to processing, storage, retrieval, and safeguarding.  Master tables should be supported by a change transaction file detailing creation, update, and deletion activities.

Consistency/Reliability – refers to whether the results can be reproduced under the same conditions.  If you apply the same filter or perform the same analysis, you should get the same results. However, consistent data also needs to be valid to be of use.  Consistently retrieving 6.500 is useful, but if the actual data is 65.00 then we have a validity issue.

Context – data may be accurate, reliable, and accessible, but without proper context it lacks meaning.  Does “555-2381” represent a phone number, an address (suite and street number), or something else?  Without content data is just bits and bytes, not useful information.

The running of analytics not only requires the existence of the previously mentioned attributes, but also can be used to test each of the attributes. 

  • mandatory fields can be tested for completeness – valid, not blank data
  • uniqueness is often a simple duplicate test (with the appropriate criteria)
  • timeliness can be assessed by calculating the time between actions such as invoice date to invoice entry date, invoice entry date to payment date, employee hire date to paid date, etc.
  • accuracy and validity can be tested by comparing values to source data, or at various points such as input to output, or reperforming the processing of the data e.g., recalculate customer balances.
  • availability can be tested by simply accessing the data.  Can it be retrieved in a format that supports your requirements and in a timely manner.
  • traceability can be assessed by comparing input and outputs values.
  • consistency can be tested by performing continuous monitoring and looking at trends
  • context can be verified by comparing to source to see if the IT defined context agrees with the source data.

Typically, when I am accessing an IT system forth e first time, I spend mote effort on assessing the data integrity, particularly the accessibility, completeness, accuracy, and context, including my understanding and interpretation.  The more experience I have with using the data, the more reliability I can place on the analytics I perform.

Part #2 of this article will discuss the use of analytics to identify and improve data integrity.

Dave Coderre, CAATS

Leave a Reply

Your email address will not be published. Required fields are marked *