Multi threading issues with database

CSchulz

I want to parse a file and transfer the content into a database. To speed up everything the file should be parsed parallel.
I have a main thread which is reading a file line by line and create Runnable, which are given to a ThreadPoolExecutor. Each Runnable gets its own Session.

Each line contains an unique identfier for a client, so repition is possible. The system tries to find the client in the database by the identifier.
If it can't be found one of the threads wanting the same client need to create the client. I have here a "join" point, where other threads have to wait for the thread which is allowed to create the client.

c = (Client) s.get("Client", identfier);
if (c == null) {
    CountDownLatch lock = isClientResolutionActive(identfier);

    if (lock != null) {
        lock.await();
        LOGGER.info("Lock was released ... " + identfier);
        c = (Client) s.get("Client", identfier);
    }
}

if (c == null) {
    c = createClient(...);

    s.save(c);
    s.flush();
    removeClientResolutionActive(identfier);
}

To synchronise them I have created two methods in the caller class, one method is dedicated to check if there is already someone doing the client creation and returns the shared object and the other removes the entry from the list and notify all waiting threads.

I have searched a lot in the internet and tried to find my problem or a similar issue without success.
Furthermore I am not sure which concurrency object I should use. After research I have decided for the CountDownLatch. It is initialized with 1. There should only be one thread creating it. (Perhaps it would be better to use something else than the CountDownLatch, but I don't know what)

The above described methods contain a synchronized block on the map, which holds the identifier of the client and the instance of CountDownLatch.

private CountDownLatch isClientResolutionActive(String identfier) {
    synchronized (activeSearches) {
        if (activeSearches.containsKey(identfier)) {
            // Only create the CountDownLatch if there are multiple threads for
            // that identfier
            if (activeSearches.get(identfier) == null) {
                activeSearches.put(identfier, new CountDownLatch(1));
            }
            return activeSearches.get(identfier);
        } else {
            LOGGER.info("Locked " + identfier);
            activeSearches.put(identfier, null);
            return null;
        }
    }
}

private void removeClientResolutionActive(String identfier) {
    synchronized (activeSearches) {
        CountDownLatch cl = activeSearches.get(identfier);
        activeSearches.remove(identfier);
        if (cl != null) {
            LOGGER.info("Unlock " + identfier);
            cl.countDown();
        }
    }
}

In general it works fine, but sometimes I encounter the problem, when the latch is released (and deleted) and the access-synchronized-variable-queue contains another thread to search for the deleted entry (to check if any thread is doing it already), it tries to create again a new client.

18:02:55,611 [pool-1-thread-2] INFO LogImporter Unlock b42fcae346fbb2b1e3c544fb816de2c5
18:02:55,611 [pool-1-thread-3] INFO LogImporter Locked b42fcae346fbb2b1e3c544fb816de2c5
18:02:55,611 [pool-1-thread-4] INFO LogImporter Lock was released ... b42fcae346fbb2b1e3c544fb816de2c5

I think I have to improve the synchronization, but I have no ideas how.

One idea would be to move the client search into the synchronized block or check before locking against the database again.
Perhaps create a cache or map, which holds all known clients from database.
Or use only one Session the whole lifecycle of the app?

Thanks in advance for any advices and hints.

Quicker

parsing the same file in parallel threads does NOT add speed but just costs extra ressources

A less problematic and more efficient text2db optimisation consists of:

  • bulk read the file (rather then line by line read 1 MB at once, process it, read next MB)
  • bulk insert into the database - mysql like this:

    insert into urtable 
    values
    ('val1','val2'),
    ('val1','val2'); 
    

(example stolen from http://bytes.com/topic/sql-server/answers/585793-insert-into-using-select-values-inserting-multiple-rows - sorry for being too lazy to make one up by myself)

  • try to prevent sql back and forth (means: if select-output is required from the database to enrich your dataset read it upfront and not on and on while walking through the file)

UPDATE ----

From the comment I took that there might be a need to get data from the database, while parsing the file. Well, if you have to do, you have to do. BUT: Try to not have to do that.

First of all: Reading specific data can be seen caching or not. In a narrow understanding caching is just moving disk data to memory by any heuristics (without knowing what is going on). I personally try to avoid this because heuristics can play against you. In a wider understanding caching is what I described before PLUS put data from disk to memory, which you can pinpoint down (eg. by ID or any filter criteria). So I still do not like the narrow understanding part of this but the behaviour selecting well defined data upfront.

Secondly: My personal experiences go like that: IF you are working on a fully normalized data model database read operations in file parsings very often cook down to 'give me the primary key(s)' of what I dumped before into the database. This appears to become tricky when you write multiple rows at once. However especially in MySQL you can definitely rely on 'each insert statement (even multiple row inserts) is atomistic', you get the ID from last_insert_id() and so you can track this down to all your previously written records. I am pretty sure there are similar 'cock downs' for other database systems.

Thirdly: Parsing LARGE files is something I would try to operate as job with only ONE technical user triggering that with ensuring that NOT >1 of these processes run in parallel. Otherwise you need to work around all sort of issues starting with file locking going into sessions permission read/write management. So running this as job classifies this (at least in my personal policies) to allocate LOTS of RAM - depending on costs and how important speed is. That means I would not even bother to load a 100 K row keyword-to-id table into memory upfront.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related