Regex parsing text and get relevant words / characters

debugcn Published at Dev

Scareactor

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.

I tried a few patterns and decided that this is the most successful yet:

pattern = "\b(\w+|\W+)\b"

Using this on my text, that is something like:

string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)

will result in my required output, but I have some chars that I don't want and some unwanted formatting:

['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over', 
' + ', 'the', ') = ',  'lazy', '[', 'dog']

My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.

The other is that my regex doesn't match the ending ];, which I also need.

bobble bubble

Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.

re.findall(r"\w+|[^\w\s]", str)

This matches

\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace

See Ideone demo

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-19

Comments

0 comments

From Dev

Related Related

Article

Regex parsing text and get relevant words / characters

Regex parsing text and get relevant words / characters

Regex get text part between two words

Regex text after words

Find a single regex to get words of 3 or more characters between two specific words

Find a single regex to get words of 3 or more characters between two specific words

Parsing text to object with regex

Regex to remove characters and supplied words

Regex for words formed with specific characters

Regex for words formed with specific characters

Parsing a string into words with no-english characters and puntuation

Python regex to get n characters before and after a keyword in a line of text

python plain text regex parsing

parsing html text with regex in javascript?

Parsing large text file with regex

Parsing a text file, based on words count

C parsing input text file into words

How to extract relevant text between two lines using regex

how do you validate characters AND words in regex?

Regex \b with words starting with special characters

Python - regex to keep only words with textual characters

Regex to match words but not numbers with certain characters

Regex to get words within a Backslash

RegEx get words with special character

Get words begin with '(' - PHP regex

Regex to get words within a Backslash

regex to get words inside parenthesis

Regex to get words by some patterns

How to break UILabel's text on characters, not on words

Finding number of characters,words and lines in a text file

How to break UILabel's text on characters, not on words