Deleting large amounts of data from MongoDB

debugcn Published at Dev

justynnuff

I have the following code that currently works. It goes through and finds every file that's newer than a specified date and that matches a regex, then deletes it, as well as the chunks that are pointing to it.

conn = new Mongo("<url>");
db = conn.getDB("<project>");

res = db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/});
while (res.hasNext()) {
    var tmp = res.next();
    db.getCollection('fs.chunks').remove({"files_id" : tmp._id});
    db.fs.files.remove({ "_id" : tmp._id});
}

It's extremely slow, and most of the time, the client I'm running it from just times out.

Also, I know that I'm deleting files from the filesystem, and not from the normal collections. It's a long story, but the code above does exactly what I want it to do.

How can I get this to run faster? ~~It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?~~ Before I was trying to use the Javascript driver, this is probably why. I assume using the Mongo shell executes everythin on the server.

Any help would be appreciated. So close but so far...

Stennie

I know that I'm deleting files from the filesystem, and not from the normal collections

GridFS is a specification for storing binary data in MongoDB, so you are actually deleting documents from MongoDB collections rather than files from the filesystem.

It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?

The majority of your code (queries and commands) is being executed by your MongoDB server. The client (mongo shell, in this case) isn't doing any significant processing.

It's extremely slow, and most of the time, the client I'm running it from just times out.

You need to investigate where the time is being spent.

If there is problematic network latency between your mongo shell and your deployment, you could consider running the query from a mongo shell session closer to the deployment (if possible) or use query criteria matching a smaller range of documents.

Another obvious candidate to look into would be server resources. For example, is deleting a large number of documents putting pressure on your I/O or RAM? Reducing the number of documents you delete in each script run may also help in this case.

db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/})

This query likely isn't doing what you intended: the filename is being provided as the second option to find() (so is used for projection rather than search criteria) and the regex matches a filename containing png anywhere (for example: typng.doc).

I assume using the Mongo shell executes everythin on the server.

That's an incorrect general assumption. The mongo shell can evaluate local functions, so depending on your code there may be aspects that are executed/evaluated in a client context rather than a server context. Your example code is running queries/commands which are processed on the server, but fs.files documents returned from your find() query are being accessed in the mongo shell in order to construct the query to remove related documents in fs.chunks.

How can I get this to run faster?

In addition to comments noted above, there are a few code changes you can make to improve efficiency. In particular, you are currently removing chunk documents individually. There is a Bulk API in MongoDB 2.6+ which will reduce the round trips required per batch of deletes.

Some additional suggestions to try to improve the speed:

Add an index on {uploadDate:1, filename: 1} to support your find() query:
```
db.fs.files.createIndex({uploadDate:1, filename: 1})
```

Use the Bulk API to remove matching chunk documents rather than individual removes:

while (res.hasNext()) {
    var tmp = res.next();
    var bulk = db.fs.chunks.initializeUnorderedBulkOp();
    bulk.find( {"files_id" : tmp._id} ).remove();
    bulk.execute();
    db.fs.files.remove({ "_id" : tmp._id});
}

Add a projection to the fs.files query to only include the fields you need:
```
var res = db.fs.files.find(
   // query criteria
   {
       uploadDate: { $gte: new ISODate("2017-04-04") },

       // Filenames that end in png
       filename: /\.png$/
   },

   // Only include the _id field
   { _id: 1 }
)
```
Note: unless you've added a lot of metadata to your GridFS files (or have a lot of files to remove) this may not have a significant impact. The default fs.files documents are ~130 bytes, but the only field you require is _id (a 12 byte ObjectId).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-30

Comments

0 comments

From Dev

Related Related

Article

Deleting large amounts of data from MongoDB

Deleting large amounts of data from MongoDB

Pull large amounts of data from a remote server, into a DataFrame

How to delete large amounts of data from Foxpro

Faster calculation for large amounts of data / inner loop

Asp.Net Sending large amounts of data from view to controller

STL containers and large amounts of data

How to persist large amounts of data by reading from a CSV file

Export of large amounts of data with JMSSerializerBundle

Is there an alternative to AtomicReferenceArray for large amounts of data?

R: automatically copy large amounts of data

Form not submitting with large amounts of data

Chrome extension store large amounts of data

Ways to pull (potentially) large amounts of data from Twitter

Method for copying large amounts of data in C#

Read from a Java InputStream with very large amounts of data multiple times

WPF controls performance issues with large amounts of data

counting and subtotaling large amounts of data

Reading large amounts of bmp data directly from file

How to Stream Through Large Amounts of Twitter Data?

BigCommerce PHP API - Pulling large amounts of data

Mule Aggregate large amounts of data

Set up a database to import large amounts of data

STL containers and large amounts of data

Is there an alternative to AtomicReferenceArray for large amounts of data?

Managing Large amounts of Data in Javascript

How to import large amounts of data from CSV file to DataGridView efficiently

Form not submitting with large amounts of data

Deleting data from MongoDB using MEAN stack

Retrieve and interpret large amounts of data from a SQL Server database

Locally store large amounts of data