Most efficient way to look for these inconsistencies?

debugcn Published at Dev

Mario

The project consists of a single table – let's call it Table for simplicity – with three columns processed by a C program:

ID is a unique ID and not really important here
Source includes input texts.
Translation includes modified/translated versions of the contents in Source.

Here's an example with made up contents: Example Table

As you can see, the table follows a certain pattern and the goal is to find inconsistencies according to these rules:

All IDs are unique and there's no association available connecting related entries with each other.
Both Source and Translation contain a majority of entries that do not follow this pattern (omitted above).
If there's a record A with Source set to ABC and a different record B with Source set to Map: ABC (it is identical to Map: followed by A's Source), then Translation of B must be identical to Karte: followed by Translation of A. Or in other words: the Translation column is supposed to follow the same pattern as Source.
In the example table above, the result of the query should tell you that ID_34567 and ID_45678 mismatch, since Translation for the latter reads Karte: Project B rather than Karte: Projekt B (as dictated by Translation of ID_34567).
The query (or queries) are supposed to be implemented in SQLite, hosted in C code (so it doesn't have to be 100% in SQLite only).
Available SQLite commands are extended with custom functions for regular expression matching (PCRE2), for example rxmatch(rx, text) returns the portion of text matching the regular expression or 0 in case of no match. This list can be expanded or modified as needed.

So far the implementation first uses a query to identify all Map: entries:

select ID, rxmatch('(?<=Map: ).*', Source) as ms, rxmatch('(?<=: ).*', Translation) as mt from `Table` where ms != 0 and mt != 0;

A second query runs for every result row and checks for inconsistencies to return them (it selects concatenated fields from a/b but I'm omitting these for readability). The parameters used are the three columns returned above (id, matched source portion, matched target portion).

select ... as translation from `Table` as a inner join `Table` as b on a.ID = ? and b.Source = ? and not b.Translation = ?;

While this works perfectly fine, it's not the fastest query and I'm wondering if there's a more elegant way to simplify this and speed it up at the same time.

CL.

SELECT ...
FROM MyTable AS A
JOIN MyTable AS B ON 'Map: '   || A.Source      =  B.Source
                 AND 'Karte: ' || A.Translation <> B.Translation;

This requires an index on Source to be efficient (or, even better, a covering index on both Source and Translation, if you have the disk space).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-13

Comments

0 comments

From Dev

Related Related

Article

Most efficient way to look for these inconsistencies?

Most efficient way to look for these inconsistencies?

Most efficient way to look for the last digit of a number?

Most efficient way to look for value in xml file?

Most efficient way to implement this?

Most efficient way to execute this

most efficient way to look up the same index number in two different lists to compare values

Efficient way to look in list of lists?

Most efficient way to compute a polynomial

query with calculations most efficient way

Most efficient way to split sentence

Most efficient way to rename an image

Most efficient way to output a newline

Most efficient way to loop through '...'

Most efficient way to determine an intersection

Most efficient way to count occurrences?

Most efficient way to concatenate Strings

Most efficient way to compute a polynomial

Most efficient way to store this data

Is this the most efficient way to write this method?

Most efficient way of MySQL rows?

Most efficient way to encrypt files?

query with calculations most efficient way

Most efficient way to search in a table

Most efficient way to write a buffer

efficient way to look for all css uses of a color

Java: The most efficient way to write pojo with ArrayList

Most efficient way to glTexture from NSImage?

Most efficient way to check for existence of multiple items

Most efficient way to delete from array?

Most efficient way to convert this string to DateTime