The project consists of a single table – let's call it Table
for simplicity – with three columns processed by a C program:
ID
is a unique ID and not really important hereSource
includes input texts.Translation
includes modified/translated versions of the contents in Source
.Here's an example with made up contents:
As you can see, the table follows a certain pattern and the goal is to find inconsistencies according to these rules:
Source
and Translation
contain a majority of entries that do not follow this pattern (omitted above).Source
set to ABC
and a different record B with Source
set to Map: ABC
(it is identical to Map:
followed by A's Source
), then Translation
of B must be identical to Karte:
followed by Translation
of A. Or in other words: the Translation
column is supposed to follow the same pattern as Source
.ID_34567
and ID_45678
mismatch, since Translation
for the latter reads Karte: Project B
rather than Karte: Projekt B
(as dictated by Translation
of ID_34567
).rxmatch(rx, text)
returns the portion of text
matching the regular expression or 0
in case of no match. This list can be expanded or modified as needed.So far the implementation first uses a query to identify all Map:
entries:
select ID, rxmatch('(?<=Map: ).*', Source) as ms, rxmatch('(?<=: ).*', Translation) as mt from `Table` where ms != 0 and mt != 0;
A second query runs for every result row and checks for inconsistencies to return them (it selects concatenated fields from a
/b
but I'm omitting these for readability). The parameters used are the three columns returned above (id, matched source portion, matched target portion).
select ... as translation from `Table` as a inner join `Table` as b on a.ID = ? and b.Source = ? and not b.Translation = ?;
While this works perfectly fine, it's not the fastest query and I'm wondering if there's a more elegant way to simplify this and speed it up at the same time.
SELECT ...
FROM MyTable AS A
JOIN MyTable AS B ON 'Map: ' || A.Source = B.Source
AND 'Karte: ' || A.Translation <> B.Translation;
This requires an index on Source
to be efficient (or, even better, a covering index on both Source
and Translation
, if you have the disk space).
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments