Deleting large amounts of data from MongoDB

justynnuff

I have the following code that currently works. It goes through and finds every file that's newer than a specified date and that matches a regex, then deletes it, as well as the chunks that are pointing to it.

conn = new Mongo("<url>");
db = conn.getDB("<project>");

res = db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/});
while (res.hasNext()) {
    var tmp = res.next();
    db.getCollection('fs.chunks').remove({"files_id" : tmp._id});
    db.fs.files.remove({ "_id" : tmp._id});
}

It's extremely slow, and most of the time, the client I'm running it from just times out.

Also, I know that I'm deleting files from the filesystem, and not from the normal collections. It's a long story, but the code above does exactly what I want it to do.

How can I get this to run faster? It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side? Before I was trying to use the Javascript driver, this is probably why. I assume using the Mongo shell executes everythin on the server.

Any help would be appreciated. So close but so far...

Stennie

I know that I'm deleting files from the filesystem, and not from the normal collections

GridFS is a specification for storing binary data in MongoDB, so you are actually deleting documents from MongoDB collections rather than files from the filesystem.

It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?

The majority of your code (queries and commands) is being executed by your MongoDB server. The client (mongo shell, in this case) isn't doing any significant processing.

It's extremely slow, and most of the time, the client I'm running it from just times out.

You need to investigate where the time is being spent.

If there is problematic network latency between your mongo shell and your deployment, you could consider running the query from a mongo shell session closer to the deployment (if possible) or use query criteria matching a smaller range of documents.

Another obvious candidate to look into would be server resources. For example, is deleting a large number of documents putting pressure on your I/O or RAM? Reducing the number of documents you delete in each script run may also help in this case.

db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/})

This query likely isn't doing what you intended: the filename is being provided as the second option to find() (so is used for projection rather than search criteria) and the regex matches a filename containing png anywhere (for example: typng.doc).

I assume using the Mongo shell executes everythin on the server.

That's an incorrect general assumption. The mongo shell can evaluate local functions, so depending on your code there may be aspects that are executed/evaluated in a client context rather than a server context. Your example code is running queries/commands which are processed on the server, but fs.files documents returned from your find() query are being accessed in the mongo shell in order to construct the query to remove related documents in fs.chunks.

How can I get this to run faster?

In addition to comments noted above, there are a few code changes you can make to improve efficiency. In particular, you are currently removing chunk documents individually. There is a Bulk API in MongoDB 2.6+ which will reduce the round trips required per batch of deletes.

Some additional suggestions to try to improve the speed:

  • Add an index on {uploadDate:1, filename: 1} to support your find() query:

    db.fs.files.createIndex({uploadDate:1, filename: 1})
    
  • Use the Bulk API to remove matching chunk documents rather than individual removes:

    while (res.hasNext()) {
        var tmp = res.next();
        var bulk = db.fs.chunks.initializeUnorderedBulkOp();
        bulk.find( {"files_id" : tmp._id} ).remove();
        bulk.execute();
        db.fs.files.remove({ "_id" : tmp._id});
    }
    
  • Add a projection to the fs.files query to only include the fields you need:

    var res = db.fs.files.find(
       // query criteria
       {
           uploadDate: { $gte: new ISODate("2017-04-04") },
    
           // Filenames that end in png
           filename: /\.png$/
       },
    
       // Only include the _id field
       { _id: 1 }
    )
    

    Note: unless you've added a lot of metadata to your GridFS files (or have a lot of files to remove) this may not have a significant impact. The default fs.files documents are ~130 bytes, but the only field you require is _id (a 12 byte ObjectId).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Pull large amounts of data from a remote server, into a DataFrame

From Dev

How to delete large amounts of data from Foxpro

From Dev

Faster calculation for large amounts of data / inner loop

From Dev

Asp.Net Sending large amounts of data from view to controller

From Dev

STL containers and large amounts of data

From Dev

How to persist large amounts of data by reading from a CSV file

From Dev

Export of large amounts of data with JMSSerializerBundle

From Dev

Is there an alternative to AtomicReferenceArray for large amounts of data?

From Dev

R: automatically copy large amounts of data

From Dev

Form not submitting with large amounts of data

From Dev

Chrome extension store large amounts of data

From Dev

Ways to pull (potentially) large amounts of data from Twitter

From Dev

Method for copying large amounts of data in C#

From Dev

Read from a Java InputStream with very large amounts of data multiple times

From Dev

WPF controls performance issues with large amounts of data

From Dev

counting and subtotaling large amounts of data

From Dev

Reading large amounts of bmp data directly from file

From Dev

How to Stream Through Large Amounts of Twitter Data?

From Dev

BigCommerce PHP API - Pulling large amounts of data

From Dev

Mule Aggregate large amounts of data

From Dev

Set up a database to import large amounts of data

From Dev

STL containers and large amounts of data

From Dev

Is there an alternative to AtomicReferenceArray for large amounts of data?

From Dev

Managing Large amounts of Data in Javascript

From Dev

How to import large amounts of data from CSV file to DataGridView efficiently

From Dev

Form not submitting with large amounts of data

From Dev

Deleting data from MongoDB using MEAN stack

From Dev

Retrieve and interpret large amounts of data from a SQL Server database

From Dev

Locally store large amounts of data

Related Related

  1. 1

    Pull large amounts of data from a remote server, into a DataFrame

  2. 2

    How to delete large amounts of data from Foxpro

  3. 3

    Faster calculation for large amounts of data / inner loop

  4. 4

    Asp.Net Sending large amounts of data from view to controller

  5. 5

    STL containers and large amounts of data

  6. 6

    How to persist large amounts of data by reading from a CSV file

  7. 7

    Export of large amounts of data with JMSSerializerBundle

  8. 8

    Is there an alternative to AtomicReferenceArray for large amounts of data?

  9. 9

    R: automatically copy large amounts of data

  10. 10

    Form not submitting with large amounts of data

  11. 11

    Chrome extension store large amounts of data

  12. 12

    Ways to pull (potentially) large amounts of data from Twitter

  13. 13

    Method for copying large amounts of data in C#

  14. 14

    Read from a Java InputStream with very large amounts of data multiple times

  15. 15

    WPF controls performance issues with large amounts of data

  16. 16

    counting and subtotaling large amounts of data

  17. 17

    Reading large amounts of bmp data directly from file

  18. 18

    How to Stream Through Large Amounts of Twitter Data?

  19. 19

    BigCommerce PHP API - Pulling large amounts of data

  20. 20

    Mule Aggregate large amounts of data

  21. 21

    Set up a database to import large amounts of data

  22. 22

    STL containers and large amounts of data

  23. 23

    Is there an alternative to AtomicReferenceArray for large amounts of data?

  24. 24

    Managing Large amounts of Data in Javascript

  25. 25

    How to import large amounts of data from CSV file to DataGridView efficiently

  26. 26

    Form not submitting with large amounts of data

  27. 27

    Deleting data from MongoDB using MEAN stack

  28. 28

    Retrieve and interpret large amounts of data from a SQL Server database

  29. 29

    Locally store large amounts of data

HotTag

Archive