CLASSIFY with Statistics Option

I have been using ACL software for more than 25 years – back before the Windows version even existed.  It wasn’t the first audit software that I had used, and I was pretty sure it wouldn’t be the last audit software I would use.  Previously I had used AutoAudit, Audit Analyzer, Decision Analyzer, IDEA, ReportWriter, Easytrieve and various spreadsheet packages (Mulitplan, Visicalc, and Lotus1-2-3).  I told myself that I would give ACL a try until it couldn’t do what I needed and I found another package did.  As I mentioned that was more than 25 years ago – and I am still putting ACL through its paces and looking at other software.

Sticking with a particular software product for this length of time has its advantages.  You build up a certain level of expertise (mainly by trying, failing and trying again).   I must have about 10,000 scripts to do various analyses that I have performed in the past.  However, it also has a distinct disadvantage: you have to keep up with the new functionalities being released.  This has become more of an issue of late as major releases seem to be occurring every six months.  I still have not spent the time and effort to understand the REGEX() and other REG-type functions.  The new interfaces and ODBC connectors, Python and R, even EXECUTE (A return of the RUN functionality) are challenging me to keep abreast with the additional power and flexibility of the AN product.

In addition, if you throw in AX and GRC – with working papers, exception reporting and monitoring, storyboarding, etc. – I am hopelessly behind in my use of ACL.  Fortunately, I have retired – well at least semi-retired – and can rest on my laurels.  I only take jobs that require knowledge of AN and in areas like REGEX() I can post to the Peer Community and get instantaneous help from around the world.

In 2001, I published a book called the “Fraud Toolkit” which I updated in 2009 under the title “Fraud Analysis Techniques Using ACL”.  The books included more than 50 scripts to search for fraud and a self-study course on scripting with data and exercises.  At the time, certain functionalities did not exist with the ACL command set, so I built scripts to perform the desired analysis.  My scripts included: Crosstab, Bendford Analysis, Frequently Occurring Numeric Values, and Min/Max Ratio analysis.  Since then ACL has added these functionalities.  Crosstab – added 2010; Benford analysis – added 2010; Frequency of numeric value (Classify on a Number) – added 2015; and Min/Max ratio – added 2015. (Note: some of the dates may be incorrect – hard to remember exact versions when ACL commands were added, but it was years after the toolkit was developed).  Many of you are aware of Crosstabs and Bendford commands in ACL, but you may not have recognized the addition of the Min/Max ratio.

The Min/Max ratio was accomplished by a script that determined the minimum, maximum and average value for any unique value of a character field.  For example, the minimum, maximum and average value of credit card transactions for each credit card number.  The script was not particularly complicated, but it did involve the use of GROUP-END logic.  Recently (2015) ACL added the ability to find the Minimum, Maximum and average values when you perform Classify, Summarize and Stratify.

This is the functionality I would like discuss, with examples, today.  More than 15 years ago I thought it was a useful analysis and devoted hours to developing my Min/Max script (I also had a Max/Max2 script).  Now it is available to all users, but like me, you might not be aware of the functionality of all enhancements.

In my first example, I will look at invoices paid to vendors.  In my case, the file I was working with was fairly small – less than 18,000 invoices paid to 6,000 different vendors.  As in the past, I performed various analyses such as STATISTICS to find the min, max and average invoice amount; STRATIFY to look atthe ranges of invoice amounts; SUMMARIZE to identify the vendors with the largest number and total dollars of invoices; and of course DUPLICATES.  All of these commands have provided enormous value and help me to understand our business and the A/P process.  For example, STRATIFY showed me that 63% of the invoices were for amounts less than $500.00 and totaled less 0.5% of the total amount.

Stratify on Amount

This caused me to question the efficiency of our p-card program for small dollar purchases (e.g. under $500).  It also show me that a small number (59) of invoices were for large amount (>$1M) and accounted for 83.47% of the total dollars – guess which invoices we were going to review?

But the addition of the Statistic option in the CLASSIFY command provided insight into the 6,000 vendors.  This was something that was not available before, unless you had my Min/Max ratio analysis script.

Classify on Vendor with Statistics Option

There were 13 vendors where the total number of invoices was greater than 10 and the amount of the maximum invoice was more than 75% of the total amount of invoices.  In other words, one invoice was significantly larger than the other invoices for the same vendor.

Looking at the second vendor (#241380), the largest invoice amount is $76,550.00; the average amount is $4,759.63: and the maximum invoice is 89% of the total for that vendor and more than 16 times the average amount.  Upon review we determined that the actual invoice amount was $765.50 and the amount had been entered incorrectly.  The same was true for vendor #301549 with a maximum amount of $16,975.00 (78% of the total) which should have been 1,697.50.

Vendor #333270 was a different problem – the $94,333.09 was actually owed to vendor #333207.  When this vendor complained that they had not been paid, we paid them with a late payment interest penalty.  The system did not recognize it as a duplicate because the vendor numbers were different.

The three vendor results below represent another type of problem.  Either we paid the wrong vendor or paid the wrong amount – but the vendor returned the check.


In each of the above cases, you can see that the minimum ABS(minimum amount) = maximum amount e.g. Vendor #276053 had a minimum of ($103,000.00) and a maximum of $103,000.00.  The vendor was not owed $103,000.00 and returned the check.

I hope that you can see the value of the Man/Max functionality.  But just to make sure, let me give you some additional examples of where I have used it:

  • Payroll – Classify on Job Category Subtotal Amount – identified people in job categories that were being paid incorrect (inflated) amounts (payroll fraud)
  • Medical Claims – Classify on Procedure Code Subtotal Amount – identified doctors who were charging 2-3 times the normal amount for medical procedures (medical claim fraud)
  • P-Card – Classify on Card Number Subtotal Amount – identified cardholders with a single large transaction (cards had been compromised)
  • P-Card – Classify on cardholder position Subtotal Amount – identified unusual transactions for admin clerks (a clerk was using the p-card for personal purchases)
  • Payroll Overtime – Classify on Job Category Subtotal Amount – identified categories where employee were not entitle to, but were receiving, overtime and individuals who were making false overtime claims
  • Contracting – Classify on Contract Type Subtotal Amount – identified sole source contract that should have been competitive
  • Contracting – Classify on Supplier Subtotal Amount – identified cases were certain suppliers received large contracts at year-end – bypassing procurement controls
  • Journal Vouchers – Classify on Period Subtotal JV Amount – identified abnormal JV’s just prior to production of quarterly financial statements.
  • Vacation – Classify on Employee Type Subtotal Vacation Days – identified an employee who coded 40 hours of vacation as 5 (i.e. days) and the system only subtract 5 hours from his vacation balance. Also identified people who were claiming large amounts of vacation.  System controls were not working effectively and employee could take more vacation days than they had in their balance (i.e. ended with a negative balance).

So, even though I had not kept up worth the enhancements offered by the more recent versions of ACL, I encourage you to do so; and I hope you can see the benefit of the Statistics option on Classify and/or Summarize.  I also hope that you post examples of where you have used this command parameter.

Extract Records Matching Criteria in another File

A common problem that presents itself when doing analysis is the need to extract detail records that match a list of values (criteria) in another file. This can range from simple to more complicated, primarily depending on the nature of the match.  The following describes three approaches, based on the objective of the analysis.

Simple question:

A simple question of this type is where we have a file with detailed transactions that we want to compare to a file that has a list of values (e.g. Payroll).  If we have a file of all pay transactions (details) and a list of employees (Employee file) who should be paid (criteria) and we want to know if every pay transaction has a matching employee record, then you can do a JOIN or RELATE with the detailed payroll transactions as your Primary (or Parent) file.  ACL allows you to do a many-to-one JOIN or RELATE.  This will give you all of the detailed payroll transactions where there is a matching employee (Matched Primary); or you can get all of the payroll transactions where there is no matching employee (Unmatched Primary).  Finally, you can also obtain both matched and unmatched (All Primary) – which is what the RELATE will give you as well.

If your results contains both match and unmatched pay record, to determine whether you have a matched record or an unmatched record in your results, simply apply a filter.

First to identify the matched records: this is the case where the employee number from payroll details (EmpNo) has a match with an employee number (EmpNum) from the employee file.

  • In JOIN it would be the EmpNo is the same as the EmpNum – filter: Empno = EmpNum
  • In a RELATE it would be similar, but since the Parent and Child files are physically separate, we need to include the table name in the filter for child fields – filter: EmpNo = Employee.EmpNum

Next to identify the unmatched records: this is the case where the employee number from payroll details (EmpNo) does not have a match with an employee number (EmpNum) from the employee file.

  • In JOIN it would be the EmpNo does not have a match with an Empnum – filter: Empno <> EmpNum
  • In a RELATE it would be similar, but again we need to include the table name in the filter for child fields – filter: EmpNo <> Employee.EmpNum

More Complicated Question:

The previous example is a many-to-one match.  What if the Secondary or Child file has duplicates records (records with the same key value).   For example, the Employee file had not only the employee number of everyone who should be paid, but also the rates of pay for different periods of time (From-To dates).  So an employee could have multiple records if their rate of pay changed.

The employee file might look like this:
Empnum        Pay rate           From Date          To Date
12345             1,200.00              Jan 1, 2016           Jul  31, 2016
12345             1,225.00              Aug 1, 2016          Dec 31, 2016
12345             1,250.00              Jan 1, 2017           Dec 31, 2017

The pay file might look like this:
   EmpNo                 Pay Amount               Pay Date
12345                    $1,200.00                      Jul 15, 2016
12345                    $1,225.00                      Aug 15, 2016
12345                    $1,225.00                      Sep 15, 2016
12345                    $1,250.00                      Oct 15, 2016
12345                    $1,250.00                      Nov 15, 2016
12345                    $1,250.00                      Dec 15, 2016

The objective might be to find not only people who were paid and should not have been (as in the previous example), but also people who were not paid the correct amount.  This means we have to match the pay details not only with the correct employee record (EmpNo = EmpNum) but also the pay period with the correct From-To dates.  That means that were have a many-to-many match.  We cannot use RELATE since it can only perform a many-to-one match.  JOIN, however, does have a many-to-many option.

So we would perform a JOIN where the pay transaction detail file is our Primary file and the Employee file (with the pay periods and rates) is our SECONDARY file.  A possible issue with a many-to-many JOIN is that the resulting file can be quite large.  For example, If I had 10,000 employees each with 26 pay records (260,000 detailed records); and an employee file where each employee had at least 10 sets of pay rates (10,000  * 10 = 100,000 employee records) and performed a JOIN many-to-many, the resulting file would have 2,600,000 records (260,000  pay records * 10 pay rate records per employee).

You have two options: perform the JOIN and then filter the results to find pay records where the Pay Date falls within the From-To Dates; or apply the filter to the JOIN command.  If the resulting file is not going to be overly large, I suggest the first option: perform the JOIN to a temporary file; apply the filter; and EXTRACT to your final results file.  However, you can also apply the filter directly to the JOIN command.  In either case, the filter will isolate the detailed pay transactions for each employee that has a matching record in the Employee file AND the Pay Date falls within the From-To Dates.

Temporary file: Filter IF BETWEEN(Pay_Date, From_Date, To_Date); and then EXTRACT to the final results.

Filter applied to JOIN command: JOIN MANY PKey EmpNo SKey EmpNum IF BETWEEN(Pay_Date, Employee.From_Date, Employee.To_Date) – note: you must include the table name for the secondary fields.

With either approach, the results will contain pay records where the matching pay rate is included.  You can now filter Pay_Amount <> Pay_Rate to identify people who were paid an incorrect amount based on the pay day and pay rate for that period.

Even More Complicated Question:

In essence, the “more complicated question” was still fairly simple because we were performing the JOIN on field that equalled the value in the Secondary file (i.e. where the key fields were equal: EmpNo = EmpNum).  It becomes “even more complicated” when the two files do not have key fields that match.  For example, I want to identify all detailed transactions where the description field contains one or more of list of keywords in my exceptions file (e.g. all P-Card transactions where the Merchant name contains a keyword such a bar, hotel, spa, casino, etc.).  I can’t perform a JOIN where Merchant Name equals the keyword value because they are not identical (“Dave’s Bar and Funspot” won’t match with the keyword “Bar”).

Instead I need to EXTRACT the records where the merchant name contains a keyword.  To do this manually, I would:

  1. Open the Keyword file
  2. Find the first keyword value (“Bar”)
  3. Open the P-Card transaction file
  4. Apply the filter FIND(“Bar”, Merchant_Name)
  5. EXTRACT all the records that meet the filter to a Results file
  6. Open the keyword file
  7. Get the next keyword value (“Hotel”)
  8. Open the P-Card transaction file
  9. Apply the filter FIND(“Hotel”, Merchant_Name)
  10. EXTRACT all the records that meet the filter to a Results file
  11. Continue until all keyword values have been selected and the filter and extract performed

Simple if you have a short list of keywords, but what if you have 1,000 keywords?  Why not have ACL do the keyword value selection, apply the filter and perform the extract – over and over again until all keywords have been applied?

The most common approach is to use a main script that calls a subscript which applies a filter and performs an extract of the records that meet the filter.  So if I have four keywords (bar, hotel, spa, and casino), the main script will call the subscript four times.  The first time the subscript will apply the filter FIND(“Bar”, Merchant_Name) and EXTRACT all the records that meet the filter to a Results file.  It then gets the next keyword value (“Hotel”) and calls the subscript again, applies filter FIND(“Hotel”, Merchant_Name) and EXTRACTs all the records that meet the filter to a Results file.  It continues to get keyword values until all the keyword values have been selected and the EXTRACTs performed.

Question #1: How many times does it call the subscript? Answer: as many times as there are keyword values – which can be determined by performing a COUNT on the keyword file.

Question #2: How does it get the 2nd, 3rd, and 4th keyword? Answer: by using a counter (v_Rec_ctr) and locating the record that equals the counter (LOCATE RECORD v_Rec_ctr).  The Counter is incremented each time the subscript is run (v_Rec_ctr = v_Rec_ctr + 1).  Assuming the P-Card file is called PCard_Trans and the Keyword file is called Key_Words which contains a field called Keyword_value (i.e. has “Bar”, Hotel”, etc.), the script looks like this:

***** Main script Extract_PCard Recs1 *****

COMMENT **** close all tables and delete previous results *************

OPEN Key_Words

COUNT – counts the number of records in the Key_Words file (i.e. number of keyword values)
v_Num_Keywords = COUNT1   – the number of times to run the subscript
v_Rec_ctr = 1   – this tells me which record in the Key_words file to read
v_keyword = BLANKS(30) – initializes variable to store the keyword value from the Key_words file – make sure you have enough blanks to hold the longest keyword value

DO Extract_PCard_Recs2 WHILE v_Rec_ctr <= v_Num_Keywords – executes the subscript until all keywords have been read.

OPEN PCard_Results – open final results file after subscript has finished

****** Script Extract_PCard Recs2 ********

OPEN Key_Words

LOCATE RECORD v_Rec_Ctr – positions ACL at the next Key_Words record (rec 1 the first time it is run)

v_Keyword = Keyword_Value – sets v_keyword to the value in Keyword_value (e.g. v_keyword = “Bar”)

OPEN Pcard_Trans

SET FILTER FIND(ALLTRIM(“%v_keyword%”), Merchant_name) –  (e.g. FIND(“Bar”, Merchant_name)

EXTRACT FIELDS ALL v_keyword as “Keyword” to PCard_Results APPEND – extract records meeting filter, plus a filed which contain the keyword value that caused the record to be extracted

V_Rec_ctr = v_Rec_Ctr + 1 – increments the record counter to get the next record in the Key_Words file.

Another option would be to combine your Keyword value values into a string (field) and use this field in your transaction file.  However, it depends on the length of each Keyword value – will all the keyword values fit into 10,000 characters when you include a “;” separator?

OPEN Key_words
V_Max_Ctr = COUNT1
v_Keyword_list = Blanks(10000)
v_found = F
v_ctr = 1

v_keyword_list = ALLTRIM(v_keyword_list) + ALLTRIM(Keyword_Value) + “;”

OPEN PCard_Trans

LOOP WHILE v_ctr <= v_Max_Ctr AND v_found = F
v_found = T IF FIND(SPLIT(v_Keyword_list, “;”, v_ctr), Merchant)
EXTRACT FIELDS ALL SPLIT(v_Keyword_list, “;”, v_ctr) AS “Match_Code” IF v_Found = T to PCard_RESULTS
v_ctr = v_ctr + 1
v_ctr = 1
v_found = F

OPEN PCard_Results



Other common examples of this type of analysis are:

  • Amount charged compared to master file with products and different prices at different times
  • Amount paid compared to Contract file – with products and different prices

Basically anything where you want to compare detailed transactions with a historical record of that includes changes (e.g. different prices for specific time periods).

There are others ways to accomplish the analyses discussed above, but in every case it must start with a good understanding of what you want to accomplish.  Such as:

  • Do I want to see all the records or just those that don’t match?
  • Do I have duplicates in the Secondary file or not?

I hope this has helped you to understand three ways to extract records meeting criteria that are contained in a second file.