このコードを高速化する方法はありますか？

debugcn 投稿 Dev

マリア・イザベル・ロペス

私はレシピのリストのためにPythonでこの検索エンジンを書いています、そしてそれは検索ごとに最大0.1秒の特定の速度で実行されることになっています。私は自分のコードでこの速度を達成するのに苦労してきました。私は平均して0.4を得ています。このコードを高速化する方法について何かアイデアがあれば知りたいと思いました。私はたくさんのことを試みましたが、ループがそれをずっと遅くしているものであることを私は知っています。私がほとんどPythonでそれを改善でき、それほど多くのモジュールを追加しないことができれば。

コードの他の部分では、平均0.005ですでに高速化しています。しかし、レシピが大量にあるこの部分では、かなり遅くなります。

def countTokens(token):
    token = str(token).lower()

    #make digits an punctuations white spaces
    tokens = token.translate(token.maketrans(digits + punctuation,\
            " "*len(digits + punctuation))) 

    return tokens.split(" ")

def normalOrder(recipes, queries):
    for r in recipes:
        k = r.keys()  
        parts, scores = [[],[],[],[]], 0
        parts[0] = countTokens(r["title"])
        parts[1] = countTokens(r["categories"]) if "categories" in k else []
        parts[2] = countTokens(r["ingredients"]) if "ingredients" in k else []
        parts[3] = countTokens(r["directions"]) if "directions" in k else []
        for q in queries:
            scores += 8 * parts[0].count(q) + 4 * parts[1].count(q) + 2 * parts[2].count(q) + 1 * parts[3].count(q)

        r["score"] = scores + r["rating"] if "rating" in k else 0
    return recipes

少しコンテキストで、上記の4つの記述子でクエリの発生量を合計する必要があるのは、それがある場合のみです。そのため、ifがあります。

He3lixxx

私はいくつかの点に気づきました：

を呼び出すたびにcountTokens、同じ変換テーブルが再度生成されます（maketrans呼び出し）。これは最適化されないので、おそらくパフォーマンスが低下していると思います。
tokens.split(" ")文字列内のすべての単語のリストを作成します。これは、文字列が100.000単語の場合など、かなりコストがかかります。あなたはそれを必要としません。
全体として、文字列に単語が含まれている頻度を単純に数えようとしているように見えます。を使用するとstring.count()、はるかに少ないオーバーヘッドで発生をカウントできます。

これを適用すると、countTokens関数は不要になり、少しリファクタリングすると、次のようになります。

def normalOrder(recipes, queries):
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        for query in queries:
            recipe["score"] += (
                8 * recipe["title"].lower().count(query)
                + 4 * recipe["categories"].lower().count(query)
                + 2 * recipe["ingredients"].lower().count(query)
                + 1 * recipe["directions"].lower().count(query)
            )

    return recipes

これはあなたのために働きますか？そしてそれは十分に速いですか？

編集：元のコードでは、アクセスrecipe["title"]と他の文字列を別のstr()呼び出しでラップしました。私は彼らがすでに文字列だと思いますか？そうでない場合は、ここに追加する必要があります。

Edit2：句読点が問題であるとコメントで述べました。コメントで述べたように、それについて心配する必要はないと思います。countクエリワードとレシピテキストの両方に句読点が含まれている場合にのみ呼び出しが句読点を考慮し、count呼び出しは次の場所での出現のみをカウントします。周囲の句読点は、照会されたものと一致します。これらの例を見てください。

>>> "Some text, that...".count("text")
1
>>> "Some text, that...".count("text.")
0
>>> "Some text, that...".count("text,")
1

これを別の方法で動作させたい場合は、元の質問で行っているようなことを行うことができます。変換テーブルを作成して適用します。この翻訳をレシピテキストに適用することは（質問で行ったように）あまり意味がないことに注意してください。それ以降、句読点を含むクエリ単語は一致しません。これは、句読点を含むすべてのクエリワードを無視するだけで、はるかに簡単に実行できます。クエリ用語の翻訳を行うと、誰かが「potato」と入力した場合に、「potato」のすべての出現箇所が見つかるようになります。これは次のようになります。

def normalOrder(recipes, queries):
    translation_table = str.maketrans(digits + punctuation, " " * len(digits + punctuation))
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        for query in queries:
            replaced_query = query.translate(translation_table)
            recipe["score"] += (
                8 * recipe["title"].lower().count(replaced_query)
                + 4 * recipe["categories"].lower().count(replaced_query)
                + 2 * recipe["ingredients"].lower().count(replaced_query)
                + 1 * recipe["directions"].lower().count(replaced_query)
            )

    return recipes

Edit3：コメントで、["honey"、 "lemon"]の検索を "honey-lemon"と一致させたいが、 "butter"を "butterfingers"と一致させたくないと述べました。このため、最初のアプローチがおそらく最善の解決策ですが、単数形の「ジャガイモ」を検索しても、複数形（「ジャガイモ」）やその他の派生形とは一致しないことに注意してください。

def normalOrder(recipes, queries):
    transtab = str.maketrans(digits + punctuation, " " * len(digits + punctuation))
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        title_words = recipe["title"].lower().translate(transtab).split()
        category_words = recipe["categories"].lower().translate(transtab).split()
        ingredient_words = recipe["ingredients"].lower().translate(transtab).split()
        direction_words = recipe["directions"].lower().translate(transtab).split()

        for query in queries:
            recipe["score"] += (
                8 * title_words.count(query)
                + 4 * category_words.count(query)
                + 2 * ingredient_words.count(query)
                + 1 * direction_words.count(query)
            )

    return recipes

同じレシピでこの関数をより頻繁に呼び出す場合は、結果を.lower().translate().split()レシピに保存することで関数のパフォーマンスを向上させることができます。呼び出しのたびにそのリストを再作成する必要はありません。

入力データ（平均していくつのクエリがあるか）split()によっては、結果を1回調べて、各単語の数を合計することも理にかなっています。これにより、1つの単語の検索が非常に高速になり、関数の呼び出し間で保持することもできますが、構築するのに費用がかかります。

from collections import Counter

transtab = str.maketrans(digits + punctuation, " " * len(digits + punctuation))

def counterFromString(string):
    words = string.lower().translate(transtab).split()
    return Counter(words)

def normalOrder(recipes, queries):
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        title_counter = counterFromString(recipe["title"])
        category_counter = counterFromString(recipe["categories"])
        ingredient_counter = counterFromString(recipe["ingredients"])
        direction_counter = counterFromString(recipe["directions"])

        for query in queries:
            recipe["score"] += (
                8 * title_counter[query]
                + 4 * category_counter[query]
                + 2 * ingredient_counter[query]
                + 1 * direction_counter[query]
            )

    return recipes

Edit4：defaultdictをCounterに置き換えました-クラスが存在することを認識していませんでした。

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-12

コメントを追加

サインイン

分類Dev

Related 関連記事

記事