Convert HTML to text #4

remram44 · 2014-01-26T16:56:52Z

Some emails might be HTML, we need to convert that to a readable text version.

remram44 · 2014-01-26T16:58:39Z

BeautifulSoup is surprisingly bad at this. Any ideas?

html = '<p>T<i>e</i>st <b>haha</b></p><p>Other\nline</p>'

from bs4 import BeautifulSoup
BeautifulSoup(html).get_text()
# 'Test hahaOther\nline'
BeautifulSoup(html).get_text(' ')
# 'T e st  haha Other\nline'
BeautifulSoup(html).get_text('\n')
# 'T\ne\nst \nhaha\nOther\nline'

remram44 · 2014-01-26T17:18:24Z

Aaron Swartz's html2text seems close enough.

from html2text import HTML2Text
HTML2Text().handle(html)
# 'T_e_st **haha**\n\nOther line\n\n'

remram44 closed this as completed Jan 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert HTML to text #4

Convert HTML to text #4

remram44 commented Jan 26, 2014

remram44 commented Jan 26, 2014

remram44 commented Jan 26, 2014

Convert HTML to text #4

Convert HTML to text #4

Comments

remram44 commented Jan 26, 2014

remram44 commented Jan 26, 2014

remram44 commented Jan 26, 2014