Remove Script tag and on attributes from HTML

user3752226

I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.

<body>
<script src="...">

    </script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">

<script type="text/javascript" language="javascript">

//&lt;![CDATA[

function CreateFixedHeaders() {}//]]&gt;
</script>
<script>

            var ClientReportfb64a4706a3749c484169e...
        </script>
</body>

My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.

The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).

Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.

So for removing the attributes I have:

script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)

As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.

Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:

<script(.*)</script>

The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.

I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.

So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.

Thanks

Update:

It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?

<script(.*)</script>

As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.

I hope this clears up my question some

Update 2

I just wanted to add a few more notes about what I am doing.

I am crawling a web site to get the data I need.

Once we have the page that contains the data we need it is saved to the database.

Then the saved web page is displayed to the user.

The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).

The error in question only happens in this one page for this one web site.

Update 3

After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.

The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.

So the solution I'm looking at is

import re


def strip_script_tags(page_source: str) -> str:
    pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
    result = re.sub(pattern, "", page_source) 
    pattern2 = re.compile(r'<script[\s\S]+?/script>')
    result = re.sub(pattern2, "", result)
    return result

I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means @skamazin's answer is correct.

skamazin

As for removing all the attributes that start with on, you can try this

It uses the regex:

\s?on\w+="[^"]+"\s?

And substitutes with the empty string (deletion). So in Python it should be:

pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file) 

If you are trying to match anything between the script tags try:

<script[\s\S]+?/script>

DEMO

The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Plone 4 removes html attributes from a script tag no matter what

From Dev

How to remove all attributes from html?

From Dev

Remove unwanted script and iframe from html page

From Dev

Load script tag from the file html

From Dev

Remove attributes from XElement

From Dev

Remove html tag from a string using jQquery

From Dev

Strip HTML tags from within the title and alt attributes of an image tag

From Dev

Remove extra "a href" tag from html string

From Dev

parsing html attributes of tag javascript

From Dev

Getting text from inside HTML tag without knowing all attributes

From Dev

Remove all attributes in HTML tag except specified with regex

From Dev

Remove unnecessary attributes from html tag using JavaScript RegEx

From Dev

PHP simple html DOM remove all attributes from an html tag

From Dev

custom attributes in a script tag

From Dev

Regex match HTML tag and attributes

From Dev

Remove attributes from XElement

From Dev

php Remove tag from html based on getAttribute

From Dev

How to remove a line from a tag in shell script?

From Dev

How to remove a script tag from a text file with sed

From Dev

Skipping Html Content in Tag attributes

From Dev

How to remove tag style from html

From Dev

How to remove script tag from html using javascript

From Dev

Remove HTML tag from string in AngularJS

From Dev

Remove all attributes in HTML tag except specified with regex

From Dev

knockout - remove html tag from title binding

From Dev

jQuery how to build json object from html tag attributes

From Dev

How to check and remove html tag from string

From Dev

There is ORDER for HTML tag attributes

From Dev

remove script and html comment from a specific div

Related Related

HotTag

Archive