Insights from the XBundle Team

Blog Banner saying "Choosing the right ingredients for Electronic Bundles" with a variety of vegetables

Conceptual Mapping in eDiscovery: Mastering workflows with Document Clustering

February 7, 2024

Conceptual Mapping in eDisclosure: Mastering workflows with Document Clustering


During any eDisclosure process, navigating through vast document datasets can be overwhelming and time consuming. Document Clustering emerges as a game-changer, visually mapping document similarities without the need for user input. Unlike traditional search tools, Document Clustering allows you to explore and understand your data effortlessly, making it an invaluable asset during the eDisclosure lifecycle.

Document Clustering aligns with PD57AD, which emphasises the “use of software, analytical tools, and coding strategies, including technology assisted review”[1] and “prioritisation” (or de-prioritisation) and “workflows”. [2]

Understanding Clustering:

At its core, Clustering employs an unsupervised machine learning algorithm that analyses both words and metadata across all documents. This algorithm, utilising a bag of words model weighted by Term frequency – Inverse Document Frequency (TF-IDF) determines conceptual similarity through a density-based Clustering approach.[3] The result is a visualisation where documents are represented as data points, each belonging to a color-coded cluster with associated terms hinting at underlying concepts.

Algorithmic Magic:

The magic begins with the algorithm breaking down words and metadata into numbers, eliminating stop words and punctuation. As the algorithm compares word frequency across documents, it starts unravelling important concepts. Imagine having two documents with the word “bank.” While a simple search might tell you both contain the word 100 times, Clustering goes further. It compares the relative frequency of “bank” to other words in the document and across the entire database, distinguishing between train station Bank and Barclays Bank. Using the same example, the algorithm may notice that the train station Bank has other weighted relevant terms like “transport” and “DLR” while Barclays Bank has relevant terms such as “investment” and “payment.”

Clustering Workflows:

Clustering serves as a pivotal tool across various stages of the eDisclosure lifecycle, encompassing Early Case Assessment (ECA), prioritising reviews, task assignments, and ensuring quality control on reviewed documents. Below are recommended workflows tailored to specific scenarios where Clustering proves invaluable:

  • Data Exploration in Early Case Assessment (ECA)

Clustering can be key in ECA. When faced with a substantial document set, it can be difficult to know where to start. Being able to visualise key topic groups that appear in your corpus can be helpful to spot common trends, unanticipated trends or unrelated trends.

After seeking a high-level overview of the corpus, you are then able to identify groups of documents that can be prioritised, de-prioritised, or even eliminated altogether.

  • Open Clustering for Conceptual Overview:

Use basic navigation, such as panning and zooming, within Clustering to gain a visual understanding of the top concepts. Depending on the conceptual similarity, the algorithm can logically group the documents and places them into clusters. After generating these clusters, the algorithm will appropriately assign a label to each document within the hierarchy, reflecting the conceptual content encapsulated by the clustered documents.

  • Skim Top Terms:

By identifying the most frequent occurring terms, users can have a greater understanding of primary themes and topics covered by the corpus without delving into the individual documents immediately.

Top terms usually reflect the most relevant and prevalent concepts within the corpus. By skimming these terms, users can quickly identify keywords and topics that are likely to be of interest or importance, guiding subsequent analysis, review, and decision-making processes.

It also facilitates efficient data exploration by allowing users to focus on terms that are most relevant to their analysis. Rather than sifting through all the documents or conducting broad searches, skimming top terms directs attention to concepts that are prevalent and potentially significant, saving time and effort in the exploration phase.[4]

  • Utilize Data Visualiser Properties:

Employ various data visualiser properties to understand document distribution based on attributes like Custodian or Doc Type.[5]

This targeted approach streamlines the review and ensures that resources are allocated efficiently.

By visualising document properties, users can discern relationships and correlations between various attributes. For example, they can observe if key custodians are associated with specific document types or if there are temporal trends in the distribution of documents. These insights can provide you with a better understanding of your corpus and identifying relevant clusters for further analysis.

  • Strengthening Predictions with Clustering and Predictive Coding

PD57AD, paragraph 9.6(3)(a) emphasises that parties must consider the use of “software or analytical tools, including technology assisted review (TAR)”to reduce the burden and cost of the disclosure exercise. Additionally, the explanatory note in section 2 of the DRD explains that parties should use TAR to conduct a proportionate review of the data set. As a result, parties must consider how to streamline their review by combining powerful analytical tools such as Clustering and predictive coding.[6]

By using the document clusters, the predictive coding algorithm can access a more diverse and comprehensive set of training data. This ensures that the predictive model captures a broader range of documents variations and nuances, leading to more accurate predictions which increases efficiency and reduces the time and resources required for manual review.[7]

Additionally, by analysing documents in each conceptually or related clusters, the predictive coding algorithm can achieve a better understanding of themes, topics, and contexts present in the corpus. This improved understanding enhances the predictive coding algorithm’s ability to accurately predict relevant documents for review.

Clustering enables iterative improvement of predictive coding models over time. As reviewers interact with the system and provide feedback on document relevance, Clustering algorithms can dynamically adjust and refine the predictive coding models. This iterative learning process enhances accuracy and effectiveness of predictive coding predictions, leading to more reliable outcomes in subsequent predicted review batches from predictive coding model.


Document clustering stands as a transformative tool within the eDisclosure lifecycle, offering a solution to the overwhelming task of navigating extensive document datasets. By visually mapping document similarities without the need for extensive user input, document clustering streamlines data exploration and comprehension unlike traditional search methods. Its alignment with PD57AD underscores its significance in facilitating prioritisation, workflows, and the utilization of technology-assisted review strategies.

As eDisclosure processes continue to evolve, leveraging Clustering proves instrumental in enhancing efficiency, accuracy, and overall outcomes throughout the eDisclosure lifecycle. Embracing Clustering empowers legal teams to navigate complex datasets with ease, ultimately driving more informed decision-making and achieving greater success in legal proceedings.


[1] PD57AD, paragraph 9.6(3) (a – b)

[2] PD57AD, paragraph 9.6(4)

[3] TF-IDF stands for Term Frequency – Inverse Document Frequency, a commonly utilised statistical method in natural language processing and information retrieval. It assesses the importance of a term in a document relative to its frequency in the entire corpus.

[4] PD57AD, 6.4(7) – “The need to ensure the case is dealt with expeditiously, fairly and at a proportionate cost”.

[5] PD57 AD, 9.6 (e) , “Documents responsive to specific keyword searches, or other automated searches (by reference, if appropriate, to individual custodians, creators, repositories, file types and/or date ranges, concepts).

[6] PD57AD,9.6 (3) (2) and the Explanatory note to Section 2 of the DRD.

[7] PD57 AD, 9.6 the parties should seek to “reduce the burden and cost of the disclosure exercise”.