Most efficient way to look for these inconsistencies?

Mario

The project consists of a single table – let's call it Table for simplicity – with three columns processed by a C program:

  • ID is a unique ID and not really important here
  • Source includes input texts.
  • Translation includes modified/translated versions of the contents in Source.

Here's an example with made up contents: Example Table

As you can see, the table follows a certain pattern and the goal is to find inconsistencies according to these rules:

  • All IDs are unique and there's no association available connecting related entries with each other.
  • Both Source and Translation contain a majority of entries that do not follow this pattern (omitted above).
  • If there's a record A with Source set to ABC and a different record B with Source set to Map: ABC (it is identical to Map: followed by A's Source), then Translation of B must be identical to Karte: followed by Translation of A. Or in other words: the Translation column is supposed to follow the same pattern as Source.
  • In the example table above, the result of the query should tell you that ID_34567 and ID_45678 mismatch, since Translation for the latter reads Karte: Project B rather than Karte: Projekt B (as dictated by Translation of ID_34567).
  • The query (or queries) are supposed to be implemented in SQLite, hosted in C code (so it doesn't have to be 100% in SQLite only).
  • Available SQLite commands are extended with custom functions for regular expression matching (PCRE2), for example rxmatch(rx, text) returns the portion of text matching the regular expression or 0 in case of no match. This list can be expanded or modified as needed.

So far the implementation first uses a query to identify all Map: entries:

select ID, rxmatch('(?<=Map: ).*', Source) as ms, rxmatch('(?<=: ).*', Translation) as mt from `Table` where ms != 0 and mt != 0;

A second query runs for every result row and checks for inconsistencies to return them (it selects concatenated fields from a/b but I'm omitting these for readability). The parameters used are the three columns returned above (id, matched source portion, matched target portion).

select ... as translation from `Table` as a inner join `Table` as b on a.ID = ? and b.Source = ? and not b.Translation = ?;

While this works perfectly fine, it's not the fastest query and I'm wondering if there's a more elegant way to simplify this and speed it up at the same time.

CL.
SELECT ...
FROM MyTable AS A
JOIN MyTable AS B ON 'Map: '   || A.Source      =  B.Source
                 AND 'Karte: ' || A.Translation <> B.Translation;

This requires an index on Source to be efficient (or, even better, a covering index on both Source and Translation, if you have the disk space).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Most efficient way to look for the last digit of a number?

From Dev

Most efficient way to look for value in xml file?

From Dev

Most efficient way to implement this?

From Dev

Most efficient way to execute this

From Dev

most efficient way to look up the same index number in two different lists to compare values

From Dev

Efficient way to look in list of lists?

From Dev

Most efficient way to compute a polynomial

From Dev

query with calculations most efficient way

From Dev

Most efficient way to split sentence

From Dev

Most efficient way to rename an image

From Dev

Most efficient way to output a newline

From Dev

Most efficient way to loop through '...'

From Dev

Most efficient way to determine an intersection

From Dev

Most efficient way to count occurrences?

From Dev

Most efficient way to concatenate Strings

From Dev

Most efficient way to compute a polynomial

From Dev

Most efficient way to store this data

From Dev

Is this the most efficient way to write this method?

From Dev

Most efficient way of MySQL rows?

From Dev

Most efficient way to encrypt files?

From Dev

query with calculations most efficient way

From Dev

Most efficient way to search in a table

From Dev

Most efficient way to write a buffer

From Dev

efficient way to look for all css uses of a color

From Dev

Java: The most efficient way to write pojo with ArrayList

From Dev

Most efficient way to glTexture from NSImage?

From Dev

Most efficient way to check for existence of multiple items

From Dev

Most efficient way to delete from array?

From Dev

Most efficient way to convert this string to DateTime