Data Analysis and Data Integrity – part #2

Part #1 looked at why assessing data integrity is an important, value-added activity that supports management decision making.  Part #2 examines things you can do to improve the integrity of the data and your analytic results. 

While I will be looking at data integrity primarily from a data analytics perspective, it is important to recognize that many people play a role in the integrity of data.  Ensuring data integrity is a complex activity that requires coordination from different aspects of the organization.  From the collection, input, processing, maintenance, security, and use it touches upon many people, processes, and technologies.

  • People – proper training regarding collection, input, handling, understanding , interpretation, and reporting of results
  • Processes – ensuring input, processing, and output maintain data’s reliability, consistency, availability, and accuracy.
  • Technology – authorization, input and change controls, maintenance and back-up, processing, and security.  Includes things like, unique keys for data base records, relationships between data sets, controls over who can create, modify, and delete data. 

To preserve data integrity and minimize risk of invalid or inappropriate analysis reults, at a minimum, the following should be in place:

  • Validate Input: When obtaining or extracting your data set from a known or unknown source you should require input validation. That data should be verified and validated to ensure that the input is accurate.
  • Validate Data: certify that your data processes haven’t been corrupted. Identify control totals key attributes that are important to your organization before you validate the data.
  • Remove Duplicate Data: valid and invalid duplicate records can be in your data. For example, data base often contain records that are marked for deletion.  If the data base hasn’t been compressed the extract routine may include these ‘deleted’ records.  The extracted data may include sub-total and totals that can affect your analytics.  All data should be carefully examined, and unwanted duplicates removed.
  • Back up Data: In addition to removing duplicates to ensure data security, data backups are a critical part of the process. Not only is backup necessary to prevent data loss, but it can also ensure that analytics that are being run again are using the same source data. If your analytics software is access the system data rather than an extract, the data will be updated on a regular basis and this will affect the consistency of your results.
  • Access Controls: your analytic data and the analytics code should be protected from unauthorized and unwanted users.  Many organizations share their analytics to improve efficiency and effectiveness, but access and modifications to the analytics code should be controlled.  Any changes should be properly tested and validated before the analytics are updated.
  • Always Keep an Audit Trail: when analytics form an integral part of the identification and assessment of critical findings, the source data, analytics, and results should be carefully maintained.  Not only will this allow for verification of the findings, but also the re-running of the analytics to assess the adequacy of management action to provided recommendations.  The analytics can also form the basis of a continuous auditing or monitoring program.

Common sources of errors that will impact your analytics, and methods to test for them, include:

  • Entry error – examples of entry errors I have seen include dates that are far in the past or future, unit prices where the stock number was entered instead of the price (Unit price > $1M when it should have been $1.31),  and invoice amounts where the invoice number was entered as the amount (vendor with a usual invoice range of $500 – 2,000 has an invoice for 2,386,451.00).  Analytics include checking that dates are valid dates (e.g., not 31 Feb 2022) and are reasonable (e.g., birth date not 21 Jan 2522); verify that mandatory fields are not blank or null;  and look for anomalies.
    • Since date fields are often critical, you should validate that the data is a valid date and check for reasonableness. 
    • Many date fields should reflect actions that have taken place so dates should not be in the future.  Financial transactions are often associated with a fiscal year, so transaction dates should fall within the fiscal year. 
    • Large differences between dates (e.g., invoice date and invoice entry date) may be an indication of incorrect entry.
    • Run statistics on amount fields to see smallest, largest, and average values
    • Summarize on key fields to identify all possible values and verify records where the value is unusual.
  • Unintended transfer errors during upload/download – examples I have seen include truncated downloads (often because of Excel record maximums), missing records, long text fields that contain numbers being treated as numeric and leading zeros lost, incompatible format (e.g., EBCDIC data not converted prior to download), incorrect file selected (often wrong date).  Analytics include verify record count and totals of key number fields between source and downloaded data. I would also do a manual comparison of a print of the first 100 records values for all fields (source to download)
  • Incorrect formatting of data – examples I have seen include different format for date field (e.g., source is DD/MM/YYYY and extract is read as MM/DD/YYYY), incorrect number of decimals (source has 3 decimal places and extract is read with two decimals), missing or incorrect order of fields, and text fields are read as numeric.  Analytics include verifying control totals for numeric fields, and validity and reasonableness of dates.  I would also do a manual comparison of the source data format and the extracted data format.
  • Processing error – processing errors can occur in the application or within the analytics being run by audit.  Examples I have seen include test data included in standard reports, failure of reports to apply appropriate filters, failure of reports to include all periods for a fiscal year, including reversed or cancelled transactions (e.g., reversed invoices), double counting (e.g., weight of vehicle was included in weight of household effects when determining move costs), and incorrect use of AND and OR logic.  The primary analytics are parallel simulations and reperforming of system processing.  I also encourage a review of all significant analytics by a colleague.
  • Incorrect Interpretation of results – identifying different pay transactions (e.g., regular pay and overtime) as duplicate pay transactions, . A careful review of the results – do they state what you were trying to identify) and verification with source and by client prior to resulting results.
  • Analysis error – examples I have seen include a failure to include overtime records when validating a contractor’s payroll costs (falsely identifying a $2.3M overcharge for salaries), and failing to properly identify reversing entries, and identifying partial payments as duplicates. As above, a fulsome review of the analytics and verification of the results is required.

When analytics form a critical part of the identification and assessment of audit objectives and lead to significant, reportable finds, the source data must support the audit objective and have sufficient integrity.  In addition, the analytics must be properly designed, implemented, and validated.

The next article will describe steps you should take when obtaining data extracts to ensure that you have the required data in a format that will allow for subsequent analysis.

Leave a Reply

Your email address will not be published.