I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (.
, that I would like to have as single chars. Of course I have to modify the \W+
to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];
, which I also need.
Why use \W+
for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+
one or more word characters|[^\w\s]
or one character, that is neither a word character nor a whitespaceCollected from the Internet
Please contact [email protected] to delete if infringement.
Comments