我正在尝试将国际专利数据库(PATSTAT)的四个大表(35-200百万行)加入到满足某些要求的引用最多的15个专利的前15名中。
第一个表格(t9
)列出了从一组应用程序(家族)到另一组应用程序的引用。另一个表(t1
)基本上将所有内容链接在一起,因为它既包含族和应用程序ID,又包含归档年份表t2
,tls209_appln_ipc
并用于标识appln_id
要包含的。
我最终得出的代码如下:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15
问题是,在我的会话超时之前,在PATSTAT的基于Web的在线基于网络界面中运行的查询未收敛。有什么方法可以提高此查询的效率吗?
-编辑
-tls_209_appln_ipc
包含1.95亿行,appln_id
加上ipc_subclass_symbol
。appln_id
在此表中可能出现零次或多次。在我询问我只需要docdb_family_id
■如果任其链接appln_id
s的连接到任何的ipc_subclass_symbol
就是我列出的。
在前面的答案的帮助下,给出了所需结果的最终代码:
SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
FROM docdb_family_citation t9
INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99
ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
LEFT JOIN
tls201_appln t1
on t9.cited_docdb_family_id = t1.docdb_family_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015 and
exists (select 1 from tls209_appln_ipc t209
where t209.appln_id = t1.appln_id
and t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) and
not exists (select 1 from tls204_appln_prior t2
where t1.appln_id = t2.appln_id
)
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;`
请注意,带有子查询的联接t99
用于获取正确的cited
计数
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句