I have the following code that currently works. It goes through and finds every file that's newer than a specified date and that matches a regex, then deletes it, as well as the chunks that are pointing to it.
conn = new Mongo("<url>");
db = conn.getDB("<project>");
res = db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/});
while (res.hasNext()) {
var tmp = res.next();
db.getCollection('fs.chunks').remove({"files_id" : tmp._id});
db.fs.files.remove({ "_id" : tmp._id});
}
It's extremely slow, and most of the time, the client I'm running it from just times out.
Also, I know that I'm deleting files from the filesystem, and not from the normal collections. It's a long story, but the code above does exactly what I want it to do.
How can I get this to run faster? It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side? Before I was trying to use the Javascript driver, this is probably why. I assume using the Mongo shell executes everythin on the server.
Any help would be appreciated. So close but so far...
I know that I'm deleting files from the filesystem, and not from the normal collections
GridFS is a specification for storing binary data in MongoDB, so you are actually deleting documents from MongoDB collections rather than files from the filesystem.
It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?
The majority of your code (queries and commands) is being executed by your MongoDB server. The client (mongo
shell, in this case) isn't doing any significant processing.
It's extremely slow, and most of the time, the client I'm running it from just times out.
You need to investigate where the time is being spent.
If there is problematic network latency between your mongo
shell and your deployment, you could consider running the query from a mongo
shell session closer to the deployment (if possible) or use query criteria matching a smaller range of documents.
Another obvious candidate to look into would be server resources. For example, is deleting a large number of documents putting pressure on your I/O or RAM? Reducing the number of documents you delete in each script run may also help in this case.
db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/})
This query likely isn't doing what you intended: the filename
is being provided as the second option to find()
(so is used for projection rather than search criteria) and the regex matches a filename containing png
anywhere (for example: typng.doc
).
I assume using the Mongo shell executes everythin on the server.
That's an incorrect general assumption. The mongo
shell can evaluate local functions, so depending on your code there may be aspects that are executed/evaluated in a client context rather than a server context. Your example code is running queries/commands which are processed on the server, but fs.files
documents returned from your find()
query are being accessed in the mongo
shell in order to construct the query to remove related documents in fs.chunks
.
How can I get this to run faster?
In addition to comments noted above, there are a few code changes you can make to improve efficiency. In particular, you are currently removing chunk documents individually. There is a Bulk API in MongoDB 2.6+ which will reduce the round trips required per batch of deletes.
Some additional suggestions to try to improve the speed:
Add an index on {uploadDate:1, filename: 1}
to support your find()
query:
db.fs.files.createIndex({uploadDate:1, filename: 1})
Use the Bulk API to remove matching chunk documents rather than individual removes:
while (res.hasNext()) {
var tmp = res.next();
var bulk = db.fs.chunks.initializeUnorderedBulkOp();
bulk.find( {"files_id" : tmp._id} ).remove();
bulk.execute();
db.fs.files.remove({ "_id" : tmp._id});
}
Add a projection to the fs.files
query to only include the fields you need:
var res = db.fs.files.find(
// query criteria
{
uploadDate: { $gte: new ISODate("2017-04-04") },
// Filenames that end in png
filename: /\.png$/
},
// Only include the _id field
{ _id: 1 }
)
Note: unless you've added a lot of metadata to your GridFS files (or have a lot of files to remove) this may not have a significant impact. The default fs.files
documents are ~130 bytes, but the only field you require is _id
(a 12 byte ObjectId).
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments