I'm currently migrating some data (> 100MM) from Oracle to Elasticsearch.
I'm using the bulk API with is working perfectly, but now I have migrated all data I want to clean up a little by removing the duplicates (generated due problems on the migration process that took like 2 days and I don't want to start over).
I can see all my duplicates with a query like this (using sense):
GET myindex/mytype/_search?search_type=count
{
"aggregations": {
"duplicates": {
"terms": {
"field": "message_id",
"min_doc_count": 2,
"size": 100
}
}
}
}
But I having a lot of problems finding a way to delete those using delete by query, you see, I need to delete the duplicates leaving one copy. I mean that if I have 2 records of the message_id XXXX I need to delete just one in order to keep 1 in the ES.
Do you know a way to achieve this?
Any help is well appreciated.
Find the ID of one document you want to save, you can then use a Delete by Query with a Not Filter.
For example, if you have 3 documents with doc IDs 1, 2, 3, all documents had the same messageId
of 13 and you want to save document 1 you can run this query:
DELETE /yourIndex/yourType/_query
{
"query": {
"filtered": {
"query": {
"term": {
"messageId": "13"
}
},
"filter": {
"not": {
"term": {
"_id": 1
}
}
}
}
}
}
Doc 2 and doc 3 will be deleted and doc 1 will still be present in the index. Test this out locally first.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments