A Comprehensive Empirical Study on Fairness in GraphRAG

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by supplying external knowledge. GraphRAG adapts this by utilizing structured knowledge graphs for retrieval, offering more semantically rich and interpretable responses. However, as these systems are deployed in high-stakes domains, it is crucial to consider their fairness. Different components of GraphRAG can introduce, amplify, or mitigate societal biases, yet little research has evaluated these effects.

This thesis addresses this gap by presenting a comprehensive empirical study on the fairness and accuracy of GraphRAG systems. Using the BBQ and BiasKG benchmarks, the impact of three key components – the LLM, the retriever, and the prompt – is evaluated. The experiments analyze various open-source and commercial models, different retrieval strategies (including varying retrieval depth), and multiple character, word, and sentence level prompt perturbations. The findings reveal a significant trade-off between accuracy and fairness, with no LLM excelling at both, although gpt-4.1-nano and qwen2.5 came close. The results imply that retrieval strategies have a nuanced impact on performance: increased retrieval depth often reinforces stereotypes or
causes confusion, and both reranking and pruning may improve fairness depending on the context.

Prompt perturbations were also shown to have a significant impact on fairness and accuracy: changes in sentence structure and word order severely degrade accuracy. On the other hand, rephrasing techniques such as back translation improved both fairness and accuracy. This research contributes a framework for evaluating GraphRAG systems and provides actionable insights and recommendations for academics and practitioners. It demonstrates that fairness is not a property of singular components, but a combination of the interactions between GraphRAG’s components, the knowledge graph, and the input prompts.

Researchers can build on this thesis by designing and evaluating new benchmarks and multi-component evaluation frameworks to further strengthen fairness in AI. For industry practitioners, this work serves as a reminder that deploying state-of-the-art models is insufficient; domain-specific evaluation and improvements are necessary to ensure a fair system. Furthermore, these findings highlight the hidden risks of using GraphRAG-like systems to society, calling attention to the growing need for a critical and knowledgeable understanding of AI. Lastly, policy-makers and governments can use the insights from this thesis to mandate transparency and robust testing of all interacting components for fair and responsible systems.