Evgenii Legotckoi
Oct. 24, 2018, 12:38 p.m.

Django - Tutorial 038. Use BeatifulSoup 4 to clean up the published content from unwanted html tags

Content

When developing a web site that adds the ability to write comments or publish articles that allow html layout, the mechanism for clearing unwanted html tags, in particular script and style tags, is important, since malicious scripts on a quality resource definitely should not be present. It will also be good to be able to clean up the style of the text, especially if the resource implies a uniform style. The discordance of screaming fonts is not needed by anyone, and adds problems with the layout.

To implement this mechanism, I use the Python package Beautiful Soup 4 and finally wrote one class, which essentially does everything I need. Removes unnecessary tags, adds necessary classes to tags, saves classes in tags, if you need to leave them during stripping, this is important for classes that are added at the stage of writing a comment, for example, when inserting a YouTube video or adding program code when the user selects which programming language should be represented in the program code block.


Install BeautifulSoup 4

  1. pip install beautifulsoup4

Program code

This example is presented in the form of a class, so that using the inheritance and redefinition of the cleaning method to form the necessary logic, and the program code of the module for cleaning html does not turn into a collection of heterogeneous, inconsistent functions.

  1. # -*- coding: utf-8 -*-
  2.  
  3. import re
  4.  
  5. from bs4 import BeautifulSoup
  6. from YourDjangoApp import settings
  7.  
  8.  
  9. class ESoup:
  10. # initialization of the text clear object,
  11. # can use immediately receive additional tags for deletion so as not to override the class
  12. def __init__(self, text, tags_for_extracting=()):
  13. self.soup = BeautifulSoup(text, "lxml") if text else None
  14. self.tags_for_extracting = ('script', 'style',) + tags_for_extracting
  15.  
  16. # Method to remove specified tags
  17. def __extract_tags(self, soup, tags=()):
  18. for tag in tags:
  19. for current_tag in soup.find_all(tag):
  20. current_tag.extract()
  21. return soup
  22.  
  23. # Method for deleting attributes of all tags
  24. def __remove_attrs(self, soup):
  25. for tag in soup.find_all(True):
  26. tag.attrs = {}
  27. return soup
  28.  
  29. # Method for deleting attributes of all tags except those listed in whitelist_tags
  30. def __remove_all_attrs_except(self, soup, whitelist_tags=()):
  31. for tag in soup.find_all(True):
  32. if tag.name not in whitelist_tags:
  33. tag.attrs = {}
  34. return soup
  35.  
  36. # Remove all attributes from all tags except those listed in whitelist_tags
  37. # If the tag is in whitelist_tags, it will only delete those attributes that are not listed in whitelist_attrs
  38. # Also, this method saves to the tag the classes listed in whitelist_classes
  39. # regardless of whether it was listed in whitelist_tags or in whitelist_attrs.
  40. # I just have classes with a special position for tags
  41. def __remove_all_attrs_except_saving(self, soup, whitelist_tags=(), whitelist_attrs=(), whitelist_classes=()):
  42. for tag in soup.find_all(True):
  43. saved_classes = []
  44. if tag.has_attr('class'):
  45. classes = tag['class']
  46. for class_str in whitelist_classes:
  47. if class_str in classes:
  48. saved_classes.append(class_str)
  49.  
  50. if tag.name not in whitelist_tags:
  51. tag.attrs = {}
  52. else:
  53. attrs = dict(tag.attrs)
  54. for attr in attrs:
  55. if attr not in whitelist_attrs:
  56. del tag.attrs[attr]
  57.  
  58. if len(saved_classes) > 0:
  59. tag['class'] = ' '.join(saved_classes)
  60.  
  61. return soup
  62.  
  63. # Adds a nofollow relationship to the tag, checking the url of the src or img attribute
  64. # If the link leads to the internal pages of your site, then nofollow will not be added.
  65. def __add_rel_attr(self, soup, tag, attr):
  66. for tag in soup.find_all(tag):
  67. attr_content = tag.get(attr)
  68. if not attr_content.startswith(settings.SITE_URL) and not attr_content.startswith('/'):
  69. tag['rel'] = ['nofollow']
  70. return soup
  71.  
  72. # Adds new classes to the tag, preserving those classes that already existed.
  73. def __add_class_attr(self, soup, tag, classes=()):
  74. for tag in soup.find_all(tag):
  75. saved_classes = []
  76. if tag.has_attr('class'):
  77. saved_classes.append(tag['class'])
  78. saved_classes.extend(list(classes))
  79. tag['class'] = ' '.join(saved_classes)
  80. return soup
  81.  
  82. # The method that performs the cleaning, I propose to override it, if you need to change the logic for cleaning up the html code
  83. def clean(self):
  84. # if BeautifulSoup was created during initialization, then you can perform the cleanup
  85. if self.soup:
  86. # Remove all tags that we don’t like.
  87. soup = self.__extract_tags(soup=self.soup, tags=self.tags_for_extracting)
  88. # Remove all attributes from all tags except
  89. # src and href for tags img and a,
  90. # and also leave prettyprint class
  91. soup = self.__remove_all_attrs_except_saving(
  92. soup=soup,
  93. whitelist_tags=('img', 'a'),
  94. whitelist_attrs=('src', 'href',),
  95. whitelist_classes=('prettyprint',)
  96. )
  97. # add rel="nofollow" for external links
  98. soup = self.__add_rel_attr(soup=soup, tag='a', attr='href')
  99. soup = self.__add_rel_attr(soup=soup, tag='img', attr='src')
  100. # improve the appearance of images using the img-fluid class
  101. soup = self.__add_class_attr(soup=soup, tag='img', classes=('img-fluid',))
  102. # add the linenums class for pre tags
  103. soup = self.__add_class_attr(soup=soup, tag='pre', classes=('linenums',))
  104. # returning useful content, the fact is that BeautifulSoup 4 adds more html and body tags,
  105. # which I, for example, do not need
  106. return re.sub('<body>|</body>', '', soup.body.prettify())
  107. return ''
  108.  
  109. # Static class method, something like Shortcut
  110. @staticmethod
  111. def clean_text(text, tags_for_extracting=()):
  112. soup = ESoup(text=text, tags_for_extracting=tags_for_extracting)
  113. return soup.clean()

Using

So

  1. soup = ESoup(text=text, tags_for_extracting=tags_for_extracting)
  2. soup.clean()

Or so

  1. ESoup.clean_text(text=text, tags_for_extracting=tags_for_extracting)
Илья Чичак
  • Dec. 5, 2018, 8:37 a.m.

я думаю, что последний


  1. @staticmethod
  2. def clean_text(text, tags_for_extracting=()):
  3. soup = ESoup(text=text, tags_for_extracting=tags_for_extracting)
  4. return soup.clean()

есть смысл заменить на classmethod (при наследовании, старый вариант сломается, а с классом - нет):

  1. @classmethod
  2. def clean_text(cls, text, tags_for_extracting=()):
  3. soup = cls(text=text, tags_for_extracting=tags_for_extracting)
  4. return soup.clean()
Evgenii Legotckoi
  • Dec. 5, 2018, 3:34 p.m.

Спасибо за информацию, не думал об этом.

Надо будет проверить на кошках.

Comments

Only authorized users can post comments.
Please, Log in or Sign up
  • Last comments
  • Evgenii Legotckoi
    March 9, 2025, 9:02 p.m.
    К сожалению, я этого подсказать не могу, поскольку у меня нет необходимости в обходе блокировок и т.д. Поэтому я и не задавался решением этой проблемы. Ну выглядит так, что вам действитель…
  • VP
    March 9, 2025, 4:14 p.m.
    Здравствуйте! Я устанавливал Qt6 из исходников а также Qt Creator по отдельности. Все компоненты, связанные с разработкой для Android, установлены. Кроме одного... Когда пытаюсь скомпилиров…
  • ИМ
    Nov. 22, 2024, 9:51 p.m.
    Добрый вечер Евгений! Я сделал себе авторизацию аналогичную вашей, все работает, кроме возврата к предидущей странице. Редеректит всегда на главную, хотя в логах сервера вижу запросы на правильн…
  • Evgenii Legotckoi
    Oct. 31, 2024, 11:37 p.m.
    Добрый день. Да, можно. Либо через такие же плагины, либо с постобработкой через python библиотеку Beautiful Soup
  • A
    Oct. 19, 2024, 5:19 p.m.
    Подскажите как это запустить? Я не шарю в программировании и кодинге. Скачал и установаил Qt, но куча ошибок выдается и не запустить. А очень надо fb3 переконвертировать в html