How can I use regex to search unicode texts and find words that contain repeated alphabets?

Masoud Masoumi Moghadam

I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.

My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:

import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")                   
print([p.sub("", x).strip() for x in strs])

I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:

سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.

It has to be like this:

سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم

please consider that more than 3 repeats are not acceptable.

Wiktor Stribiżew

You may use

re.sub(r'([^\W\d_])\1{2,}', r'\1', s)

It will replace chunks of identical consecutive letters with their single occurrence.

See the regex demo.

Details

  • ([^\W\d_]) - Capturing group 1: any Unicode letter
  • \1{2,} - two or more repetitions of the same letter that is captured in Group 1.

The r'\1' replacement will only keep a single letter occurrence in the result.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Regex to find words that contain hyphen MS Office

分類Dev

python - unicode regex match - how do I search for the checkmark? ✓

分類Dev

How can I search multiple lines that contain different values?

分類Dev

Android Studio and regex search (find more words)

分類Dev

How can I find all files that do NOT contain a text string?

分類Dev

How can I use regex to find if a string has 2 specific characters and remove them if they are?

分類Dev

How can I use regex wildcard in assertion?

分類Dev

How to partial search for words using regex python

分類Dev

ajax search how to check if certain phrase contain the searched words

分類Dev

Get all alphabets in a string of words using regex (including spaces)

分類Dev

How can I combine the texts with same date in the python dataframe?

分類Dev

How can I find the percentage of a regex match on a string?

分類Dev

How can I find an IP address in a long string with REGEX

分類Dev

How can I find entries in a group of scripts that contain text from a list?

分類Dev

How can I use a box-shadow on an element that has background-size:contain?

分類Dev

Regex Match "words" That Contain Periods perl

分類Dev

How can I use collection.find as a result of a meteor method?

分類Dev

How can I use two bash commands in -exec of find command?

分類Dev

SQL how can I skip repeated values on a colum

分類Dev

How can I index an exact value repeated in multiple columns in Excel

分類Dev

How to use regex inside exec with find?

分類Dev

react-native How can I put two texts in a row view with left alignment?

分類Dev

How can I find File name in string which start with # Using RegEx

分類Dev

regex how to find and replace at multiple instance along multi line search

分類Dev

(How) Can I use Apache Tika to search a .DOC or .PDF or .JAVA (etc.) file for a phrase?

分類Dev

How can I insert values into a MySQL table which contain whitespaces

分類Dev

How can I use regex to construct an API call in my Jekyll plugin?

分類Dev

How can I search the ubuntu source code?

分類Dev

How can I search for the number in the array?

Related 関連記事

  1. 1

    Regex to find words that contain hyphen MS Office

  2. 2

    python - unicode regex match - how do I search for the checkmark? ✓

  3. 3

    How can I search multiple lines that contain different values?

  4. 4

    Android Studio and regex search (find more words)

  5. 5

    How can I find all files that do NOT contain a text string?

  6. 6

    How can I use regex to find if a string has 2 specific characters and remove them if they are?

  7. 7

    How can I use regex wildcard in assertion?

  8. 8

    How to partial search for words using regex python

  9. 9

    ajax search how to check if certain phrase contain the searched words

  10. 10

    Get all alphabets in a string of words using regex (including spaces)

  11. 11

    How can I combine the texts with same date in the python dataframe?

  12. 12

    How can I find the percentage of a regex match on a string?

  13. 13

    How can I find an IP address in a long string with REGEX

  14. 14

    How can I find entries in a group of scripts that contain text from a list?

  15. 15

    How can I use a box-shadow on an element that has background-size:contain?

  16. 16

    Regex Match "words" That Contain Periods perl

  17. 17

    How can I use collection.find as a result of a meteor method?

  18. 18

    How can I use two bash commands in -exec of find command?

  19. 19

    SQL how can I skip repeated values on a colum

  20. 20

    How can I index an exact value repeated in multiple columns in Excel

  21. 21

    How to use regex inside exec with find?

  22. 22

    react-native How can I put two texts in a row view with left alignment?

  23. 23

    How can I find File name in string which start with # Using RegEx

  24. 24

    regex how to find and replace at multiple instance along multi line search

  25. 25

    (How) Can I use Apache Tika to search a .DOC or .PDF or .JAVA (etc.) file for a phrase?

  26. 26

    How can I insert values into a MySQL table which contain whitespaces

  27. 27

    How can I use regex to construct an API call in my Jekyll plugin?

  28. 28

    How can I search the ubuntu source code?

  29. 29

    How can I search for the number in the array?

ホットタグ

アーカイブ