ES deleting duplicates

debugcn Published at Dev

William Añez

I'm currently migrating some data (> 100MM) from Oracle to Elasticsearch.

I'm using the bulk API with is working perfectly, but now I have migrated all data I want to clean up a little by removing the duplicates (generated due problems on the migration process that took like 2 days and I don't want to start over).

I can see all my duplicates with a query like this (using sense):

GET myindex/mytype/_search?search_type=count
{
  "aggregations": {
    "duplicates": {
      "terms": {
        "field": "message_id",
        "min_doc_count": 2,
        "size": 100
      }
    }
  }
}

But I having a lot of problems finding a way to delete those using delete by query, you see, I need to delete the duplicates leaving one copy. I mean that if I have 2 records of the message_id XXXX I need to delete just one in order to keep 1 in the ES.

Do you know a way to achieve this?

Any help is well appreciated.

Dan Tuffery

Find the ID of one document you want to save, you can then use a Delete by Query with a Not Filter.

For example, if you have 3 documents with doc IDs 1, 2, 3, all documents had the same messageId of 13 and you want to save document 1 you can run this query:

DELETE /yourIndex/yourType/_query
{
    "query": {
        "filtered": {
            "query": {
                "term": {
                    "messageId": "13"
                }
            },
            "filter": {
                "not": {
                    "term": {
                        "_id": 1
                    }
                }
            }
        }
    }
}

Doc 2 and doc 3 will be deleted and doc 1 will still be present in the index. Test this out locally first.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-5

Comments

0 comments

From Dev

Related Related

Article

ES deleting duplicates

ES deleting duplicates

Deleting Duplicates in a List

Deleting duplicates in a VBA Array

Deleting reversed duplicates with R

Deleting duplicates in a time series

Deleting duplicates on SQLBase

Deleting datagridview row without deleting duplicates

R deleting duplicates in other columns

Deleting pairwise duplicates in SQL Server

Deleting pairs of duplicates between columns

Deleting duplicates based on another column

Stata: Deleting duplicates based on dates

Remove Duplicates from range without deleting data

Deleting duplicates references in list c#

Checking for (and Deleting) Complex Object Duplicates in SQL Server

Deleting visible duplicates of visible rows only

deleting duplicates based on value of another column

Combining, sorting and deleting duplicates in numerous gzip files

Remove Duplicates from range without deleting data

User Defined Value for Deleting Duplicates in Stored Procedure

Deleting visible duplicates of visible rows only

Deleting rows which sum to zero in 1 column but are otherwise duplicates in pandas

Deleting duplicates but leaving one according to sorting in a different column

Deleting Duplicates with VBA Based on Two Columns- Excel 2003

Frequency count of particular field appended to line without deleting duplicates

Oracle merge - unable to get stable set of rows after deleting duplicates

Deleting duplicates based on two columns on specific Excel sheet

Deleting Duplicates from an Array of Classes and Sorting An Array By Date

ES6 import duplicates?

Deleting Duplicates From CSV File Wtih Multi-Columns Based On A Specific Column