Not being an expert in pretty much anything has never stopped me from taking ideas from others and adapting them to my needs. Typically, this has meant transferring a theoretical analytic to a specific – practical – use. In this case, I wanted to look at the use of sentiment analysis for fraud risk assessment.
As many of you may already know, I use a software package called ACL. I have been using it for almost 30 years – so it is my “go-to” tool. I wasn’t really sure about my ability to apply ACL to sentiment analysis; mainly because I would be analyzing free-form text files. Usually, I deal with more structure data; however, ACL was easily able to read and analyze the text files save in .RTF format.
My other concern centered on the analysis itself. I had read several articles describing the analytics required to perform sentiment analysis. Many included algorithms that not only search for “negative” sentiment words, but ensure that these are not “negated”, for example “The practice is corrupt” vs “The practice is not corrupt”. I wondered about the value of the extra effort to determine if sentiments were stated in a more “positive” manner through negation. I also recognized that the English language is very flexible – even when used in a manner that is grammatically correct (which my emails are often not). Take for example the following statements and the sentiment:
- The practice is corrupt – Negative
- I have no doubt the practice is corrupt – Negative
- The practice is not corrupt – Positive
- The practice, corrupt when seen from the outside, is in fact not – Positive
- The practice is not really corrupt – Questionable
- I believe the practice is not corrupt – Questionable
As you can see, the existence of a “negative” can be before or after the sentiment word; and may or may not reverse the sentiment. Dealing with negation would require many more months of study and a much more complicated algorithm than I was prepared to develop. But was it absolutely necessary?
I ponder awhile and can to a significant realization based on the fact that I would not be analyzing individual texts at a point in time, but multiple texts over months. In my mind, this meant that I could look at trends. I also thought (without any research to determine if this were true or not) that the mere existence of negative sentiment words was worth measuring, even if the sentiment was negated. I believe that if the instances of ‘”corrupt” in the text was increasing, this was an important measure even if the sentiment was “not corrupt”.
I wanted an analytic that would allow me to address questions such as: Am I seeing more instances of negative sentiments now than last month or last year? Are there specific texts that have a high occurrence of negative sentiment words? Is a particular employee using negative sentiment words more than others? What is the sentiment score (number of negative sentiment words / total number of words) in total by week/month; and by employee? While this would not be what many would consider to be a robust sentiment analysis, I believe it would be useful – and better yet, easily doable with ACL.
The analysis required the following steps:
- Obtain the data in either .txt or .rft format (not Word or PDF documents)
- Import the text files into ACL
- Isolate the individual words
- Summarize the text to get a count for each word (retaining the document title and the creator)
- Obtain a list of Negative and Positive sentiment words
- Join my word count summary to the Positive/negative sentiment word list
- Produce a positive/negative score for each text, for each creator and in total for the week/month/year.
I was surprised to realize that none of the steps were particularly difficult. I used a table layout which treated the text file as a fixed length file with a record length of 1 byte. Then I concatenated the single bytes until I hit a non-character (A-Z) value and extracted this as a “word”; summarized on the “word” value in each document – retaining the document name and the creator; summarized on the “word” value in all documents; downloaded a Negative/Positive sentiment list from the internet; and ran my analysis.
Sample output for a single text file:
Negative Word Count
So now I can start looking at trends to obtain a leading indicator of changing sentiments – positive or negative – by week, by month. I could even look at changing trends by type of document (internal emails, external emails, memos, etc).
Next step I would like to perform is to see whether certain document types can be read directly – without having to convert to text documents – using ACL’s data connectors (e.g. Outlook emails).
Is this a full-blown sentiment analysis? No. Did it take me less than an hour to develop with existing tools? Yes. Will it provide useful results? I think so – but only time will tell.
This article has 1 Comment
I love the idea that “the practice is not corrupt” claimed enough times is a sign that there might be issues. And if it is trending up? Even worse.
Of course, the Bard said it first. “The lady doth protest too much, methinks.”
On the other hand, maybe if you do an automatic analysis of emails, maybe you are also picking up when the new season on everyone’s most famous corrupt family is released, in the show House of Cards. I imagine some emails are exchanged on that