最適なパフォーマンスを得るために、Neo4jモデル/クエリのパフォーマンス/構成に関するアドバイスが必要

debugcn 投稿 Dev

DK5

リレーショナルデータをT-SQLからGraphDBに（Neo4jを使用して）シフトできるかどうかをテストするために、GraphDBの実験に取り組んでいます。グラフ構造をクエリするとメリットが得られる大量のデータを処理することを検討しています。現時点では、いくつかの単純なwhere句と集計手順でも、クエリのパフォーマンスが非常に低くなっています。Neo4jは数十億のノードで動作すると主張しているので、パフォーマンスを向上させる方法についてアドバイスを得るとよいでしょう。これが私たちが試したすべてです。

それでは、データについて説明します。オンラインで訪問/購入した国（地理）と製品（SKU）に関する顧客データがあります。顧客がWebサイトにアクセスするたびに、そのビュー/購入は、30分後に変更される一意のセッションIDの一部として追跡されます。個別のセッションIDを計算することにより、ユーザーがWebサイトにアクセスした回数を正確に計算しようとしています。

ウェブサイトにアクセスしたときに行われた顧客の訪問/購入に関連する約2600万行のデータがあります。SQLのデータは次の形式です。

----------------------------------------------------------------------------
|    Date|   SessionId|   Geography|   SKU|   OrderId|    Revenue|   Units||
|--------|------------|------------|------|----------|-----------|--------||
|20160101|         111|         USA|     A|      null|          0|       0||
|20160101|         111|         USA|     B|         1|         50|       1||
|20160101|         222|          UK|     A|         2|         10|       1||
----------------------------------------------------------------------------

問題：顧客がサイトにアクセスした回数を正確に計算する必要があります。訪問は個別のセッションIDとして計算されます。

訪問計算ロジックの説明：上記のモデルで、「A」という名前のSKUを探している人がサイトにアクセスした訪問を表示すると、答えは2になります。セッション111の最初のビューとセッション222の2番目のビュー。 SKU「A」または「B」を探してサイトにアクセスしたユーザーの訪問数を知りたい場合、回答も2になります。これは、セッション111で両方の製品が表示されたが、合計訪問数であるためです。 1つだけです。セッション111には2つの製品ビューがありますが、訪問は1つだけです。したがって、222からの他の訪問を数えると、まだ合計2回の訪問があります。

構築したモデルは次のとおりです。データに存在する行ごとに1つずつ、ファクトノードがあります。それぞれ400と4000の異なる地理ノードと製品ノードを作成しました。これらの各ノードは、複数のファクトと関係があります。同様に、Datesには個別のノードがあります。

セッションIDと注文IDにも個別のノードを作成しました。これらは両方とも事実を示しています。したがって、基本的に、次のプロパティを持つ個別のノードがあります。

1) Geography  {Locale, Country}
2) SKU {SKU, ProductName}
3) Date {Date}
4) Sessions {SessionIds}
5) Orders {OrderIds}
6) Facts {Locale, Country, SKU, ProductName, Date, SessionIds, OrderIds}

関係スキーマは、一致するプロパティ値に基づいており、次のようになります。

(:Geography)-[:FactGeo]->(:Facts)
(:SKU)-[:FactSKU]->(:Facts]
(:Date)-[:FactDate]->(:Facts)
(:SessionId)-[:FactSessions]->(:Facts)
(:OrderId)-[:FactOrders]->(:Facts)

スキーマのスナップショットは次のとおりです。

インデックスがないことが問題の原因である可能性があると述べた方もいらっしゃると思いますが、必要なインデックスはすべてあります。私がほとんどクエリを実行しない余分なインデックスを追加しても、パフォーマンスがそれほど大幅に低下することはないと思います。

合計4,400万のノードがあり、それらのほとんどはFactsノードとSessionIdノード用です。1億3100万の関係があります。

約20か国に属する人々と約20の製品の個別の訪問を特定するためにクエリを実行しようとすると、回答を得るのに約44秒かかります。同じ場合（Neo4jでインデックスが作成されている場合）、SQLは約47秒（インデックスなし）かかります。SQLでインデックスを作成するとパフォーマンスが向上すると思うので、これはNeo4jを使用して得たいと思っていた例外的な改善ではありません。

私が書いたクエリは次のようなものでした：

(geo: Geography)-[:FactGeo]->(fct: Facts)<-(sku: SKU)
WHERE geo.Country IN ["US", "India", "UK"...]
AND sku.SKU IN ["A","B","C".....]
MATCH (ssn: Sessions)-[:FactSessions]->(fct)
RETURN COUNT(DISTINCT ssn.SessionId);

PROFILEを使用すると、約69Mのdbヒットが発生します。

Q1) Is there a way I can improve this model to have a better performing query? For example i can change the above model by removing the Session nodes and just counting the SessionIds present on Fact nodes as in the query below:

(geo: Geography)-[:FactGeo]->(fct: Facts)<-(sku: SKU)
WHERE geo.Country IN ["US", "India", "UK"...]
AND sku.SKU IN ["A","B","C".....]
RETURN COUNT(DISTINCT fct.SessionId);

Which happens because of the huge number of nodes and relationships between Facts and Sessions. So it seems that i would rather benefit from having SessionIds as a Property of Facts nodes.

When i use PROFILE, this results in approx 50M db hits:

Also, can someone help me understand the tipping point where it becomes difficult to scan nodes on the basis of properties as I increase the number of properties the nodes have?

Q2) Is there something wrong with my Neo4j configurations as it is taking 44 seconds? I have a 114GB ram for the java heap, but no SSD. I have not tweaked around with other configurations and would like to know if those could be the bottleneck over here as I was told that Neo4j could run on billions of nodes?

My Machine's Total RAM: 140GB RAM dedicated to Java heap: 114GB (From what I recollect, there was almost negligible performance increase as i moved from 64GB RAM to 114GB) Page Cache Size: 4GB Approximate GraphDB size: 45GB Neo4j Version i am using: 3.0.4 Enterprise Edition

Q3) Is there any better way to formulate a query which performs better? I tried the following query:

(geo: Geography)-[:FactGeo]->(fct: Facts)
WHERE geo.Country IN ["US", "India", "UK"...]
MATCH (sku: SKU)-[:FactSKU]->(fct)
WHERE sku.SKU IN ["A","B","C".....]
RETURN COUNT(DISTINCT fct.SessionId);

But it gives around the same performance and records the same number of DBhits as the slightly improved query in Q1.

When i use PROFILE, this results in approx 50M db hits, exactly same as the query in Q1:

Q4) If i modify my query from Q3 to as below, instead of seeing an improvement i see a major decrease in performance:

MATCH (geo: Geography)
WHERE geo.Country IN ["US", "India", "UK"...]
WITH geo
MATCH (sku: SKU)
WHERE sku.SKU IN ["A","B","C".....]
WITH geo, sku
MATCH (geo)-[:FactGeo]->(fct: Facts)<-[:FactSKU]-(sku)
RETURN COUNT(DISTINCT fct.SessionId);

This appears to be creating a cross join between the 400 Geography nodes and the 4000 sku nodes, and then testing each relationship to possible exist between one of those 1,600,000 possible relationship combinations. Am i understanding this correctly?

I know these are long questions and a very long post. But I have tried tirelessly for more than a week to work these things out on my own and i have shared some of my findings over here. Hopefully the community will be able to guide me with some of these queries. Thanks in advance for even reading the post!

EDIT-01: Tore, Inverse and Frank, Thanks a lot for trying to help me out guys, I hope we can figure out the root cause here.

A) I have added some more details, regarding my PROFILE results and also my SCHEMA and Machine/Neo4j config stats.

B) As i consider the model that @InverseFalcon suggested, and try to keep in mind the facts about relationships being a better choice and limiting the number of relationships.

I am tweaking Inverse's model a bit because I think we might be able to reduce it a bit. How is this for as a model:

(:Session)-[:ON]->(:Date)
(:Session)-[:IN]->(:Geography)
(:Session)-[:Viewed]->(:SKU)
(:Session)-[:Bought]->(:SKU)

(:Session)-[:ON]->(:Date)
(:Session)-[:IN {SKU: "A", HasViewedOrBought: 1}]->(:Geography)

Now both of the models can have advantages. In the first one I maintain SKUs as distinct nodes and have different relationships between them to determine if it was a purchase or a view.

In the second one, i completely remove the SKU nodes adding them as relationships. I understand that this will lead to many relationships, but the number of relationships will still be small as we are also discarding all the nodes and relationships of SKU nodes which we are removing. We will have to test the Relationship by comparing the SKU strings, and that is an intensive operation and perhaps could be avoided by keeping only Session and Geography nodes and removing the Date nodes and adding Date property to the SKU relationships. As below:

(:Session)-[:ON]->(:Date)
(:Session)-[:IN {Date: {"2016-01-01"}, SKU: "A", HasViewedOrBought: 1}]->(:Geography)

But then I would be testing the relationships between the Geography and SKU nodes on the basis of two properties, both of which are strings. (Arguably date can be converted to integer, but still i see we have another face-off between alternate models)

C) @Tore, thanks for explaining and confirming my understanding of Q4. But if the GraphDB does a calculation like that, in which it joins and compares every relation with that join, isn't it actually working in the same manner an RDBMS should? It is ineffective in utilizing the graph traversals that it should easily be able to do by finding direct paths between the two set of Geography and Product nodes. This seems to be a bad implementation to me?

InverseFalcon

It seems to me that you're trying to do both graph modeling and RDBMS modeling at the same time, and that at least is adding an additional traversal step in your queries.

While I can't say this will result in a major performance improvement, I would consider removing your :Fact nodes, as they contain redundant information that is already captured in your graph. (assuming that session IDs aren't ever reused)

It's just a matter of wiring up your nodes without a :Fact as the central one tying them together. Sessions and Orders are likely going to be your primary nodes.

So your relationships between your nodes might look like this:

(:Session)-[:From]->(:Geography)
(:Session)-[:Visited]->(:Product)
(:Session)-[:On]->(:Date)
(:Session)-[:Ordered]->(:Order)
(:Order)-[:Of]->(:Product)

We're assuming that since the session time window is small enough, that we can count a session's date as the same as the order or visitation date from that session. If we need something more specific, we can add a relationship between an :Order and a :Date, and add a date property to a :Visited relationship (assuming we don't want to add a :Visit node as an intermediary between a session and a Product).

This changes your query to something like:

MATCH (geo:Geography)<-[:From]-(ssn:Session)-[:Ordered]->(:Order)-[:Of]->(sku:Product)
WHERE geo.Country IN ["US", "India", "UK"...]
AND sku.SKU IN ["A","B","C".....]
RETURN COUNT(DISTINCT ssn);

I'm assuming that :Sessions are unique, with a unique SessionId property, so there should be no need to get the distinct property itself, just use the node.

As Tore noted, indexes and unique constraints are critical here, especially with the size of your data set. Geography.Country, Session.SessionID, Product.SKU, and Order.OrderId should probably all have unique constraints.

Use PROFILE to see where your queries may be running into problems.

And all that said, your use case here probably will not see a significant improvement over a RDBMS, as this kind of data both models well and queries well in a relational db. Are there any questions of your data that you are either unable to get or unable to get quickly in your current db?

EDIT

In response to your edit, it is also helpful to expand (show more detail on) the operations in your PROFILE so you can see not just the operation and the db hits, but also what aspect of your query the operation concerns.

Based upon what we find when we expand those operations, we're likely to see an opportunity to improve the query performance, as I'm guessing there's a massive difference in numbers between those who have bought the products in question, and the total sessions within a country.

One possible area we could improve on is suggesting which index to use in the query, as traversing from the product to the sessions of users who bought it, then to the countries associated with the sessions, ought to be more performant than trying to match from sessions of all users from the given countries.

It's important to note that the benefits of Neo4j shine when you are querying across smaller subgraphs of data, and not the whole data set, or huge chunks of the data set. The subgraphs you are looking at in your example queries are still quite large, looking at purchase histories of users across entire countries. Those kind of queries are best done with RDBMs, and at that scale you are doing millions of graph traversals, which aren't trivial...to find connections between Geography and Product nodes it still must perform those traversals and use set operations to filter only those that connect. I would imagine, though, when asking queries about data at this scale (products bought by users across many different countries), that this is more of an analysis operation, not an operation servicing a user in realtime, so I'm wondering if performance concerns are critical for these kind of queries.

You would start to see performance improvements as your queried subgraph shrinks. You may start to see this if your queries narrowed down the countries queried.

Even better if you're asking about individual user purchase history, as then your queried subgraph is local to a user. But then you're modeling that perfectly fine in an RDBMs already, since your row of all the data you need is in a single table.

Remember that Neo4j's strength is in doing traversals vs joins, but in your current RDBMs data model you aren't doing any joins, everything you need is in indexed rows. It seems to me that yours is a use case where the queries you plan to use span huge subgraphs, and the data model is actually more complicated in a graph than it is in an RDBMs, and you're not getting much out of that added complexity with the queries you've provided.

When you're considering graph databases, what should really drive your decision is the queries you plan on making on it, and in general what questions you might ask of the relationships in that data, and if those questions are hard to answer in your current db. It seems to me that if your example queries are representative of those you plan to make regularly, your current db handles this just fine. If you were asking questions that were harder to answer in your current solution, and were more relationship-based (such as product suggestions for a user based upon what other users who have bought or viewed those products have bought), a graph db solution would make more sense, and could either be used for realtime queries, or have query results cached and renewed periodically.

PERFORMANCE IMPROVEMENT EDIT

これらの：Factsノードがあるので、実際に多くのトラバーサルを行う必要はないように思われます。しかし、それはRDBMSとまったく同じなので、この種のクエリを使用すると、RDBMSのパフォーマンスが向上します。

MATCH (sku: SKU)-[:FactSKU]->(fct: Facts)
WHERE sku.SKU IN ["A","B","C".....]
AND fct.Country IN ["US", "India", "UK"...]
RETURN COUNT(DISTINCT fct.SessionId)

このクエリ（sku.SKUが一意またはインデックス付けされていると仮定）では、グラフを使用して、製品に関連する：Factsの検索を最適化するだけです（製品に基づいてフィルタリングする代わりに、関連するすべての：Factsを直接取得するため）。その時点で、Countryフィールドは：Factオブジェクトにすでに存在しているため、フィルタリングに必要なものはすべて揃っているので、そこで実行します。

楽しみのために、これを純粋なリレーショナルクエリと比較することをお勧めします。

MATCH (fct: Facts)
WHERE fct.SKU IN ["A","B","C".....]
AND fct.Country IN ["US", "India", "UK"...]
RETURN COUNT(DISTINCT fct.SessionId)

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-05-29

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

最適なパフォーマンスを得るために、Neo4jモデル/クエリのパフォーマンス/構成に関するアドバイスが必要

最適なパフォーマンスを得るために、Neo4jモデル/クエリのパフォーマンス/構成に関するアドバイスが必要

neo4jクエリパフォーマンスをインポートするためにすべてのノードと関係をRAMに変換する方法

neo4jクエリのパフォーマンス-1次以上のノードを取得する

最高のクエリパフォーマンスを得るためにテーブルで使用するSQLインデックス

最高のパフォーマンスを得るために、バッファを64バイト境界に揃える必要があるのはなぜですか？

最適なパフォーマンスを得るために、Juliaで型宣言が必要になる頻度はどれくらいですか？

Cassandra最適なパフォーマンスを得るために、列/行はいくつですか？

複数のFacebookJsonリクエストに関するパフォーマンスアドバイス

パフォーマンスを向上させるためにクエリを最適化する

Qtペインティングに関するパフォーマンス改善のアドバイスが必要

サイファーを使用したNeo4jの新しいインデックスシステムのパフォーマンスが、開始点を指定するよりも大幅に悪いのはなぜですか？

Neo4jグラフデータベースで複雑な一致をスコアリングする際のパフォーマンス？

TransactSQLクエリのパフォーマンスに関するアドバイス

SQLDBでの動的ソートのクエリパフォーマンスを最適化するためのデザインパターン

検索パフォーマンスを最適化するためのPostgreSQLjsonbインデックス作成

MacOSのパフォーマンスを向上させるためにSambaを構成する際のエラー

インデックスを使用してmySqlクエリのパフォーマンスを最適化する

ジュリア：パフォーマンスを最適化するための一定のフィールドを持つ構造

優れたパフォーマンスを得るためにListViewアイテムのOnClickListenerを実装する方法（遅いスクロールを回避する）

なぜgitが「最適なパフォーマンスのためにバックグラウンドでリポジトリを自動パッキングする」というメッセージを表示し続けるのですか？

パフォーマンステストのために、HTTPリクエストのデフォルトで「すべての埋め込みリソースを取得する」をチェックする必要がありますか？

sshコマンドを使用するためにファイアウォールをバイパスする

ThreeJSパフォーマンスの改善に関するアドバイス

Neo4jグラフモデリングのパフォーマンスとクワイアビリティ、ノードへのプロパティ、または個別のノードと関係として

neo4jグラフのサイクルのパフォーマンスへの影響について？

nodejsのスケーラビリティとパフォーマンスを使用してファイルをs3にアップロードするためのグッドプラクティスを検討する必要があります

パフォーマンスに関する空のElasticSearchインデックスのオーバーヘッド

uitableviewパフォーマンスipadのためにもっと行う必要がある最適化

最適なパフォーマンスを得るために、def、cdef、またはcpdefを使用してCython関数を定義する必要がありますか？

Neo4jCypherパフォーマンスクエリの最適化