Find Duplicate Text

Duplicate records cause numerous problems for business and on top of that, it wastes a lot of efforts. For example, if a client wants to find similar records depending on the few columns to eliminate duplicates which can be benefits to reduce processing, quick decision.


Generally, fixing duplicate records is a manual process that is both tedious and costly. Unless all the details are identical, it is hard to say whether records are duplicate or not. Typically, most potential duplicates are false positives.

Database queries for duplicates will not help to find spelling mistakes, typos, changes few values or rephrasing.

This is the case when we need artificial intelligence (AI) to steps in. We can create & train machine learning algorithm using matching score to find duplicate records. Once trained, AI will predict whether or not records are duplicate or not.

AI Model can be build/trained based on customer requirement, here I will focus on Python, Amazon Elastic Search and Azure Search, we shall look at various options to:
  • Match Query
  • Exact search
  • Percentage base score

Before you take any decision, please respond to the below questions:
  • Do you have Machine Learning (Python) skill resources?
  • Do you have the required infrastructure?
  • Do you have post production support team that could manage and fix if required?
  • Do you have skilled tester that can validate machine learning test result?

If any answer from the above question is No, don’t worry. There are powerful AWS and Azure services available which you can leverage and achieve the similar functionality. Below are the links for the services:

The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.

Here, I am sharing the Test Result I performed test using Python, Azure Search and Elastic, this will help you to take a decision which one you should choose.

To perform testing, we have rewritten the original text and added more complexity using online tools and perform testing against the original text.

PythonAzure SearchElastic Search
Total Test Performed454545
Top Result363844
Performance in sec103-42-3

In the result, you can see Elastic search result is more powerful, it is able to search almost all text and also Performance is better than the other two.

Now let's focus on Elastic Search to know how we can perform a different search.

1. This AWS Elastic Search query will help to find the exact match similar to SQL like query.

SQL Query
Select ColumnName from Table where Field like ‘%Search Text%’)

Below is Elasticsearch like query:
GET /_search
{
   "query": {
      "query_string" : {
      "default_field" : "column name",
      "query" : "search query"
     }
   }
}

Executed using Browser:
http://localhost:9200/idea/_search?q="Text Query"

2. Find percentage base score rather than Relevance Score: Elastic search is capable of returning result based on the threshold defined.
GET /_search
{
   "query":{
      "multi_match" :{
      "query":"Search Query",
      "fields":[
      "ColumnName"
     ],
    "fuzziness":"AUTO",
    "minimum_should_match":"80%"
   }
  }
}

Comparison Python Model Build using Gensim Library vs ElasticSearch
Test SR#Python Custom ResultElasticsearch Result
Test 180%Result with > 0.80% Match
Test 284%Result with > 0.70% Match
Test 388%Result with > 0.80% Match
Test 481%Result with > 0.80% Match
Test 50Not Found
Test 684%Result with > 0.80% Match
Test 70Not Found
Test 881%Result with > 0.75% Match
Test 90Not Found
Test 100Not Found
Test 110Not Found
Test 120Not Found
Test 1381%Result with > 0.79% Match
Test 1484%Result with > 0.80% Match
Test 1588%Result with > 0.80% Match
Test 1696%Result with > 0.80% Match
Test 1796%Result with > 0.80% Match
Test 1891%Result with > 0.76% Match
Test 1992%Result with > 0.80% Match
Test 2090%Result with > 0.80% Match
Test 2189%Result with > 0.80% Match
Test 2297%Result with > 0.80% Match
Test 230Not Found
Test 2495%Result with > 0.80% Match
Test 2583%Result with > 0.80% Match

Conclusion
It is very difficult to say which result is better when comparing python with Elasticsearch. My recommendation is to use Elasticsearch as it is High-Quality recommended and proven system solution in the market since many years.

Comments