Django - Tutorial 038. Use BeatifulSoup 4 to clean up the published content from unwanted html tags

BeaftifulSoup4, Django, Python, HTML, Python 3

When developing a web site that adds the ability to write comments or publish articles that allow html layout, the mechanism for clearing unwanted html tags, in particular script and style tags, is important, since malicious scripts on a quality resource definitely should not be present. It will also be good to be able to clean up the style of the text, especially if the resource implies a uniform style. The discordance of screaming fonts is not needed by anyone, and adds problems with the layout.

To implement this mechanism, I use the Python package Beautiful Soup 4 and finally wrote one class, which essentially does everything I need. Removes unnecessary tags, adds necessary classes to tags, saves classes in tags, if you need to leave them during stripping, this is important for classes that are added at the stage of writing a comment, for example, when inserting a YouTube video or adding program code when the user selects which programming language should be represented in the program code block.

Install BeautifulSoup 4

pip install beautifulsoup4

Program code

This example is presented in the form of a class, so that using the inheritance and redefinition of the cleaning method to form the necessary logic, and the program code of the module for cleaning html does not turn into a collection of heterogeneous, inconsistent functions.

# -*- coding: utf-8 -*-

import re

from bs4 import BeautifulSoup
from YourDjangoApp import settings


class ESoup:
    # initialization of the text clear object,
    # can use immediately receive additional tags for deletion so as not to override the class
    def __init__(self, text, tags_for_extracting=()):
        self.soup = BeautifulSoup(text, "lxml") if text else None
        self.tags_for_extracting = ('script', 'style',) + tags_for_extracting

    # Method to remove specified tags
    def __extract_tags(self, soup, tags=()):
        for tag in tags:
            for current_tag in soup.find_all(tag):
                current_tag.extract()
        return soup

    # Method for deleting attributes of all tags
    def __remove_attrs(self, soup):
        for tag in soup.find_all(True):
            tag.attrs = {}
        return soup

    # Method for deleting attributes of all tags except those listed in whitelist_tags
    def __remove_all_attrs_except(self, soup, whitelist_tags=()):
        for tag in soup.find_all(True):
            if tag.name not in whitelist_tags:
                tag.attrs = {}
        return soup

    # Remove all attributes from all tags except those listed in whitelist_tags
    # If the tag is in whitelist_tags, it will only delete those attributes that are not listed in whitelist_attrs
    # Also, this method saves to the tag the classes listed in whitelist_classes
    # regardless of whether it was listed in whitelist_tags or in whitelist_attrs.
    # I just have classes with a special position for tags
    def __remove_all_attrs_except_saving(self, soup, whitelist_tags=(), whitelist_attrs=(), whitelist_classes=()):
        for tag in soup.find_all(True):
            saved_classes = []
            if tag.has_attr('class'):
                classes = tag['class']
                for class_str in whitelist_classes:
                    if class_str in classes:
                        saved_classes.append(class_str)

            if tag.name not in whitelist_tags:
                tag.attrs = {}
            else:
                attrs = dict(tag.attrs)
                for attr in attrs:
                    if attr not in whitelist_attrs:
                        del tag.attrs[attr]

            if len(saved_classes) > 0:
                tag['class'] = ' '.join(saved_classes)

        return soup

    # Adds a nofollow relationship to the tag, checking the url of the src or img attribute
    # If the link leads to the internal pages of your site, then nofollow will not be added.
    def __add_rel_attr(self, soup, tag, attr):
        for tag in soup.find_all(tag):
            attr_content = tag.get(attr)
            if not attr_content.startswith(settings.SITE_URL) and not attr_content.startswith('/'):
                tag['rel'] = ['nofollow']
        return soup

    # Adds new classes to the tag, preserving those classes that already existed.
    def __add_class_attr(self, soup, tag, classes=()):
        for tag in soup.find_all(tag):
            saved_classes = []
            if tag.has_attr('class'):
                saved_classes.append(tag['class'])
            saved_classes.extend(list(classes))
            tag['class'] = ' '.join(saved_classes)
        return soup

    # The method that performs the cleaning, I propose to override it, if you need to change the logic for cleaning up the html code
    def clean(self):
        # if BeautifulSoup was created during initialization, then you can perform the cleanup
        if self.soup:
            # Remove all tags that we don’t like.
            soup = self.__extract_tags(soup=self.soup, tags=self.tags_for_extracting)
            # Remove all attributes from all tags except
            # src and href for tags img and a,
            # and also leave prettyprint class
            soup = self.__remove_all_attrs_except_saving(
                soup=soup,
                whitelist_tags=('img', 'a'),
                whitelist_attrs=('src', 'href',),
                whitelist_classes=('prettyprint',)
            )
            # add rel="nofollow" for external links
            soup = self.__add_rel_attr(soup=soup, tag='a', attr='href')
            soup = self.__add_rel_attr(soup=soup, tag='img', attr='src')
            # improve the appearance of images using the img-fluid class
            soup = self.__add_class_attr(soup=soup, tag='img', classes=('img-fluid',))
            # add the linenums class for pre tags
            soup = self.__add_class_attr(soup=soup, tag='pre', classes=('linenums',))
            # returning useful content, the fact is that BeautifulSoup 4 adds more html and body tags,
            # which I, for example, do not need
            return re.sub('<body>|</body>', '', soup.body.prettify())
        return ''

    # Static class method, something like Shortcut
    @staticmethod
    def clean_text(text, tags_for_extracting=()):
        soup = ESoup(text=text, tags_for_extracting=tags_for_extracting)
        return soup.clean()

Using

So

soup = ESoup(text=text, tags_for_extracting=tags_for_extracting)
soup.clean()

Or so

ESoup.clean_text(text=text, tags_for_extracting=tags_for_extracting)





We recommend hosting TIMEWEB
We recommend hosting TIMEWEB
Stable hosting, on which the social network EVILEG is located. For projects on Django we recommend VDS hosting.
Support the author Donate

я думаю, что последний


    @staticmethod
    def clean_text(text, tags_for_extracting=()):
        soup = ESoup(text=text, tags_for_extracting=tags_for_extracting)
        return soup.clean()

есть смысл заменить на classmethod (при наследовании, старый вариант сломается, а с классом - нет):

    @classmethod
    def clean_text(cls, text, tags_for_extracting=()):
        soup = cls(text=text, tags_for_extracting=tags_for_extracting)
        return soup.clean()

Спасибо за информацию, не думал об этом.

Надо будет проверить на кошках.

Comments

Only authorized users can post comments.
Please, Log in or Sign up
How to become an author?

Contribute to the evolution of the EVILEG community.

Learn how to become a site author.

Learn it
Donate

Good day, Dear Users!!!

I am Evgenii Legotckoi, developer of EVILEG. And it is my hobby project, which helps to learn programming another programmers and developers

If the site helped you, and you want also support the development of the site, than you can donate by following ways

PayPalYandex.Money
Timeweb

Let me recommend you the excellent hosting on which EVILEG is located.

For many years, Timeweb has been proving his stability.

For projects on Django I recommend VDS hosting

View Hosting Timeweb
MN
May 25, 2020, 11:33 a.m.
Mitja Nagibin

C ++ - Test 004. Pointers, Arrays and Loops

  • Result:50points,
  • Rating points-4
f
May 25, 2020, 5:05 a.m.
falcon

C++ - Test 001. The first program and data types

  • Result:66points,
  • Rating points-1
jm
May 25, 2020, 3:30 a.m.
just maks

C ++ - Test 004. Pointers, Arrays and Loops

  • Result:80points,
  • Rating points4
Last comments
May 26, 2020, 6:51 a.m.
Evgenij Legotskoj

Qt/C++ - Lesson 004. QSqlTableModel – How to present the table from database?

У вас база данных не открылась Исправьте путь к базе данных на свой корректный в следующих методах void DataBase::connectToDataBase() bool DataBase::openDataBase()
T1
T1
May 26, 2020, 6:22 a.m.
Tima 1

Qt/C++ - Lesson 004. QSqlTableModel – How to present the table from database?

полностью повторил структору проекта. В форму дабавил tableView. Но при запуске получаю форму только с пустым tableView. Можете подсказать в чем пробелма?
May 26, 2020, 6:02 a.m.
Evgenij Legotskoj

Qt/C++ - Lesson 004. QSqlTableModel – How to present the table from database?

Потому что это файл который нужно создать, а не библиотека. В статье есть содержание этого файла. Добавляйте в проект. Копируйте содержимое из статьи.
T1
May 26, 2020, 6 a.m.
Tima 1

Qt/C++ - Lesson 004. QSqlTableModel – How to present the table from database?

не удается подключиить библеотеку include "database.h" выдает ошибку. Можете помочь?
Now discuss on the forum
May 26, 2020, 5:16 a.m.
BlinCT

Отсутствие драйвера SQLite в пакете Qt 4 на Linux

Вот честно непонимаю почему до сих пор используют qt4, там же столько всего отсутствует, много фишек и возможностей нету там. То есть используя такое старье приходится много писать самому а не и…
DK
May 26, 2020, 2:24 a.m.
Dzhon Kofi

Disable autoscroll

такие естественные решения все перепробовал. Получилось вчера так: const int maximumScroll = ui->_samples->verticalScrollBar()->maximum();const int sliderPos = ui->_samp…
May 26, 2020, 12:43 a.m.
Ruslan Polupan

Посоветуйте новичку (базы данных и Qt, что учить)

Без БД сейчас практически никуда. Поэтому SQL надо знать. SQLite самы простой вариант, но имхо лучще начать с бд клиент-сервер. Настроить сервер. Подключаться клиентом. Просто это помогает понят…
EJ
May 25, 2020, 2:42 p.m.
Esteban José María

Компиляция пустого проекта Qt Android

qt 5.12.8 BUILD SUCCESSFUL in 42s 28 actionable tasks: 28 executed Android package built successfully in 68.251 ms. Ну, буду разбираться по-тихоньку. :)
s
May 25, 2020, 1:24 p.m.
sander-007

Использование файлов в памяти (memory file mapping)

Добрый вечер, проблемы работы с файлом Exel нет вообще. Весь смысл в том чтобы не создавать на диске физический файл (требования безопасности), дабы потом не чистить. А так вопрос только в этом …
About
Services
© EVILEG 2015-2020
Recommend hosting TIMEWEB