The latest industry news and insights

Data Deduplication

DSARs: Data Deduplication and Near-Deduplication

Data Deduplication and Near-Deduplication automatically cull duplicate content, or groups near-duplicate content when processing DSARs.

Data Deduplication and Near-Deduplication can increase operational efficiency and transform messy data with unrivalled speed and accuracy. One of the biggest challenges facing the modern Data Subject Access Request (DSAR) process is the sheer number of documents you must look through. Despite all the amazing technological advancements in recent years, document review remains the slowest and costliest phase when responding to DSARs. Even small DSAR cases can involve tens of thousands of documents which, unfortunately, is very a time-consuming task to manage.

At no point during a DSAR project do you want to find that you have reviewed the same document, or nearly the same document, repeatedly. To avoid this situation, CYFOR’s DSAR solution includes deduplication and near-deduplication which automatically culls duplicate, or groups near-duplicate content. On average, deduplication alone can reduce the overall review population by 30 – 40%. Ultimately, the fewer documents there are in the pile, the faster and cheaper the review can be completed.


How does Deduplication work?

Deduplication detects exact copies of documents and removes all but a single version of that document. The deduplication tool works by reading the metadata accompanying that file and calculates a hash value using the document’s creation date, sender or author, or email header etc… The software then compares all the hash values, identifies duplicate documents, and eliminates unnecessary copies.


How does Near-Deduplication work?

Near-Deduplication (or near-duplicate detection) is a method of clustering similar documents together. Instead of analysing metadata (like deduplication) it compares the actual text within documents and creates bundles of documents with similar text. For example, nearly identical versions of documents such as contracts might be very similar, but they could have important variations. During this process, near-deduplication does not discard anything from the pile, but simply groups related documents together so that a reviewer can consider all of them at once. By grouping similar documents together, you can quickly assess an entire pile of related documents and determine whether they need further examination or whether they can simply be discarded.



Not only does Deduplication and Near-Deduplication save time and money, but they also improve the consistency of results. Instead of having multiple reviewers come across multiple identical documents or nearly identical documents; identical documents will not exist, and you can have one reviewer consider all the near-identical documents at once, ensuring that consistent and efficient decisions are made throughout the DSAR review.

If your existing DSAR solution does not identify both duplicates and near-duplicates, CYFOR can help. To find out more about our end-to-end DSAR service, book your FREE DSAR demonstration today.

Back to all Posts

Call us today and speak with a Forensic Specialist

Send an enquiry to our experts

After submitting an enquiry, a member of our team will be in touch with you as soon as possible

Your information will only be used to contact you, and is lawfully in accordance with the General Data Protection Regulation (GDPR) act, 2018.