ES deleting duplicates

William Añez

I'm currently migrating some data (> 100MM) from Oracle to Elasticsearch.

I'm using the bulk API with is working perfectly, but now I have migrated all data I want to clean up a little by removing the duplicates (generated due problems on the migration process that took like 2 days and I don't want to start over).

I can see all my duplicates with a query like this (using sense):

GET myindex/mytype/_search?search_type=count
{
  "aggregations": {
    "duplicates": {
      "terms": {
        "field": "message_id",
        "min_doc_count": 2,
        "size": 100
      }
    }
  }
}

But I having a lot of problems finding a way to delete those using delete by query, you see, I need to delete the duplicates leaving one copy. I mean that if I have 2 records of the message_id XXXX I need to delete just one in order to keep 1 in the ES.

Do you know a way to achieve this?

Any help is well appreciated.

Dan Tuffery

Find the ID of one document you want to save, you can then use a Delete by Query with a Not Filter.

For example, if you have 3 documents with doc IDs 1, 2, 3, all documents had the same messageId of 13 and you want to save document 1 you can run this query:

DELETE /yourIndex/yourType/_query
{
    "query": {
        "filtered": {
            "query": {
                "term": {
                    "messageId": "13"
                }
            },
            "filter": {
                "not": {
                    "term": {
                        "_id": 1
                    }
                }
            }
        }
    }
}

Doc 2 and doc 3 will be deleted and doc 1 will still be present in the index. Test this out locally first.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Deleting Duplicates in a List

From Dev

Deleting duplicates in a VBA Array

From Dev

Deleting reversed duplicates with R

From Dev

Deleting duplicates in a time series

From Dev

Deleting duplicates on SQLBase

From Dev

Deleting datagridview row without deleting duplicates

From Dev

R deleting duplicates in other columns

From Dev

Deleting pairwise duplicates in SQL Server

From Dev

Deleting pairs of duplicates between columns

From Dev

Deleting duplicates based on another column

From Dev

Stata: Deleting duplicates based on dates

From Dev

Remove Duplicates from range without deleting data

From Dev

Deleting duplicates references in list c#

From Dev

Checking for (and Deleting) Complex Object Duplicates in SQL Server

From Dev

Deleting visible duplicates of visible rows only

From Dev

deleting duplicates based on value of another column

From Dev

Combining, sorting and deleting duplicates in numerous gzip files

From Dev

Remove Duplicates from range without deleting data

From Dev

User Defined Value for Deleting Duplicates in Stored Procedure

From Dev

Deleting visible duplicates of visible rows only

From Java

Deleting rows which sum to zero in 1 column but are otherwise duplicates in pandas

From Dev

Deleting duplicates but leaving one according to sorting in a different column

From Dev

Deleting Duplicates with VBA Based on Two Columns- Excel 2003

From Dev

Frequency count of particular field appended to line without deleting duplicates

From Dev

Oracle merge - unable to get stable set of rows after deleting duplicates

From Dev

Deleting duplicates based on two columns on specific Excel sheet

From Dev

Deleting Duplicates from an Array of Classes and Sorting An Array By Date

From Dev

ES6 import duplicates?

From Dev

Deleting Duplicates From CSV File Wtih Multi-Columns Based On A Specific Column

Related Related

  1. 1

    Deleting Duplicates in a List

  2. 2

    Deleting duplicates in a VBA Array

  3. 3

    Deleting reversed duplicates with R

  4. 4

    Deleting duplicates in a time series

  5. 5

    Deleting duplicates on SQLBase

  6. 6

    Deleting datagridview row without deleting duplicates

  7. 7

    R deleting duplicates in other columns

  8. 8

    Deleting pairwise duplicates in SQL Server

  9. 9

    Deleting pairs of duplicates between columns

  10. 10

    Deleting duplicates based on another column

  11. 11

    Stata: Deleting duplicates based on dates

  12. 12

    Remove Duplicates from range without deleting data

  13. 13

    Deleting duplicates references in list c#

  14. 14

    Checking for (and Deleting) Complex Object Duplicates in SQL Server

  15. 15

    Deleting visible duplicates of visible rows only

  16. 16

    deleting duplicates based on value of another column

  17. 17

    Combining, sorting and deleting duplicates in numerous gzip files

  18. 18

    Remove Duplicates from range without deleting data

  19. 19

    User Defined Value for Deleting Duplicates in Stored Procedure

  20. 20

    Deleting visible duplicates of visible rows only

  21. 21

    Deleting rows which sum to zero in 1 column but are otherwise duplicates in pandas

  22. 22

    Deleting duplicates but leaving one according to sorting in a different column

  23. 23

    Deleting Duplicates with VBA Based on Two Columns- Excel 2003

  24. 24

    Frequency count of particular field appended to line without deleting duplicates

  25. 25

    Oracle merge - unable to get stable set of rows after deleting duplicates

  26. 26

    Deleting duplicates based on two columns on specific Excel sheet

  27. 27

    Deleting Duplicates from an Array of Classes and Sorting An Array By Date

  28. 28

    ES6 import duplicates?

  29. 29

    Deleting Duplicates From CSV File Wtih Multi-Columns Based On A Specific Column

HotTag

Archive