GenAI in HR Analytics – Strengths, Limitations, Workarounds — Part I
The world is abuzz with terms like Generative AI (GenAI), Large Language Models (LLMs), ChatGPT etc. There’s a bit of a mysterious aura surrounding these areas.
On the one hand, there’s something magical about a machine responding to our questions in a near-human manner.
On the other hand, people who have actually tried to use these tools for business decision making, have often seen these falling short of expectations.
This is one of a series of blog posts which will attempt to clear the haze around these areas, from a practical standpoint.
The overall presentation style would be simple and direct, using scenarios to drive home the points, augmented by real-world screenshots.
Business Context — Voice of Employees (VoE)
Let’s pick Employees/HR as the domain/function of interest, mainly because most companies believe that their success is founded on their employees. HR Heads and Executives want to listen to what their employees are saying. They typically elicit feedback from their employees in various shapes and forms. For any decently large sized company, manually going through all the feedback is an uphill task. Listening to this VoE, is one of the areas where LLMs like GPT can excel. Having said that, there are some practical challenges that we need to watch out for. Let’s dive in…
Challenge 1: Hallucination
Let’s say you collect a few employee reviews from your organization and send them to an LLM. The goal is to extract what topics the employees care about, and what are the sentiments tied to those topics.
The results come back from the LLM, and they look impressive at first sight. But then you dig in and start reading through the results in detail. You notice certain topics identified by the LLM, that do not make any sense.
To make it real, here’s the screenshot of a sample dashboard that was built using the results from the above exercise.
Let’s focus our attention on the topics pulled out from the reviews, by the LLM engine. You see a topic called — ‘bug and leaf’. Now what on earth is that?
You decide to read through the reviews from which the LLM picked up the topic ‘bug and leaf’, to double-check if there are any typos in the reviews. None. Screenshot of the reviews below.
This is a clear example of what is called ‘hallucination’. LLMs have a tendency to cook up topics that do not make any sense in the context.
There are other such topics in the above results, like ‘Cloud Open Culture’.
Challenge 2: Duplication of Topics
Another issue that is seen in such LLM results, is that of duplication of topics. Let’s look at an example set of results, and then we’ll understand why this happens.
In the above screenshot, the LLM is considering topics like Technologies, Tech stack, Tech Integration, Tech Discussions, Tech Choice as different. But practically speaking, they are all pretty similar, from the perspective of summarizing employee feedback. If you were doing this exercise manually, you would simply group all of them into one topic, to avoid your list of topics getting diluted.
While this is obviously a problem, it can manifest itself in various damaging ways. For example, on the left side of the above screen, you have the prioritized list of topics that the employees are talking about, positively as well as negatively. Any company typically wants to take actions to improve the negative areas. There may be a critical topic, which does not get enough ‘votes’ just because the LLM distributed the votes for that topic across several similar topics.
This can lead to a wrong understanding of the feedback, thereby leading to wrong decisions.
And why does this happen? An online LLM like GPT sets a limit to the number of words that you can send per question. So, you cannot bundle hundreds or thousands of employee reviews and send that to the LLM engine in one shot. And when you send the reviews with multiple LLM calls, the topic identified by the LLM in each call can vary, even if the reviews are talking about the same thing. And this is what is seen in the above screenshot.
Challenge 3: LLM ignoring key but less frequent topics
A third practical issue seen with LLMs, is the problem of glazing over topics which may be less frequent in the Foundation Models used by the LLMs, but relevant for the employees of a specific organization. LLMs typically skip such topics in the results sent back.
For e.g., in many companies, employees would love to work at their client’s place — typically called ‘onsite’ — for various reasons. The below screenshot shows the results when HR department is searching for this topic in the results from the LLM.
The search does not yield any result, since this was not a key topic in the Foundational LLM model, and got left out of the results.
The Solution — Pre and Post-processing of the LLM results
It is clear that the results from the LLMs need to be ‘cleaned’ up a bit. But it is obviously not feasible to do that manually on a regular basis. The solution is to have an automated pipeline/engine that takes care of the above problems, and others similar to these. Yeah, this last phrase is critical, else the same problem will rear its head very frequently in different shapes and forms.
So, the above solution needs to be fairly generic in nature. Not just for the HR departments across companies, but also for other domains and contexts.
Just to wrap up the topic, the below screenshots showcase how a well-designed engine can solve the above problems. The first screenshot below is for Challenge #2, and the second one is for Challenge #3 above. The results are highlighted in red.