How can I use regex to search unicode texts and find words that contain repeated alphabets?

debugcn 投稿 Dev

Masoud Masoumi Moghadam

I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.

My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:

import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")                   
print([p.sub("", x).strip() for x in strs])

I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:

سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.

It has to be like this:

سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم

please consider that more than 3 repeats are not acceptable.

Wiktor Stribiżew

You may use

re.sub(r'([^\W\d_])\1{2,}', r'\1', s)

It will replace chunks of identical consecutive letters with their single occurrence.

See the regex demo.

Details

([^\W\d_]) - Capturing group 1: any Unicode letter
\1{2,} - two or more repetitions of the same letter that is captured in Group 1.

The r'\1' replacement will only keep a single letter occurrence in the result.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-12

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

How can I use regex to search unicode texts and find words that contain repeated alphabets?

How can I use regex to search unicode texts and find words that contain repeated alphabets?

Regex to find words that contain hyphen MS Office

python - unicode regex match - how do I search for the checkmark? ✓

How can I search multiple lines that contain different values?

Android Studio and regex search (find more words)

How can I find all files that do NOT contain a text string?

How can I use regex to find if a string has 2 specific characters and remove them if they are?

How can I use regex wildcard in assertion?

How to partial search for words using regex python

ajax search how to check if certain phrase contain the searched words

Get all alphabets in a string of words using regex (including spaces)

How can I combine the texts with same date in the python dataframe?

How can I find the percentage of a regex match on a string?

How can I find an IP address in a long string with REGEX

How can I find entries in a group of scripts that contain text from a list?

How can I use a box-shadow on an element that has background-size:contain?

Regex Match "words" That Contain Periods perl

How can I use collection.find as a result of a meteor method?

How can I use two bash commands in -exec of find command?

SQL how can I skip repeated values on a colum

How can I index an exact value repeated in multiple columns in Excel

How to use regex inside exec with find?

react-native How can I put two texts in a row view with left alignment?

How can I find File name in string which start with # Using RegEx

regex how to find and replace at multiple instance along multi line search

(How) Can I use Apache Tika to search a .DOC or .PDF or .JAVA (etc.) file for a phrase?

How can I insert values into a MySQL table which contain whitespaces

How can I use regex to construct an API call in my Jekyll plugin?

How can I search the ubuntu source code?

How can I search for the number in the array?