Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XHTMLImporterImpl.convert() can't handle converting custom data attributes for br tags to xhtml #101

Open
JamaicanFriedChicken opened this issue Dec 7, 2023 · 1 comment

Comments

@JamaicanFriedChicken
Copy link

JamaicanFriedChicken commented Dec 7, 2023

I have a data attribute that is appended in a <br> tag, for example <br data-suggestion="ef0oraskdmd">, when I am trying to convert it into a xhtml format, it gives me the below error:

ERROR org.docx4j.convert.in.xhtml.XHTMLImporterImpl - org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 55991; The element type "br" must be terminated by the matching end-tag "</br>"."

When I remove the data attribute data-*, XHTMLImporterImpl is able to convert it to a xhtml format. How can I mitigate this issue? Is there any temporary fix I can implement?

docx4j-ImportXHTML - 11.4.6
Java 11

@jiubafangxing
Copy link

you can try like this,

  private static String preprocessHtml(String htmlContent) {
        Document doc = Jsoup.parse(htmlContent);
        Elements codeElements = doc.select("code");
        Map<String, String> codeReplacements = new HashMap<>();

        // Replace new lines with <br /> in <code> elements
        for (Element codeElement : codeElements) {
            String codeText = codeElement.html().replace("\n", "<br />");
            codeReplacements.put(codeElement.html(), codeText);
        }

        String processedHtml = doc.html();
        for (String key : codeReplacements.keySet()) {
            processedHtml = processedHtml.replace(key, codeReplacements.get(key));
        }

        processedHtml = processedHtml.replaceAll("<img(.*?)>", "<img$1 />");
        processedHtml = processedHtml.replaceAll("<br>", "<br />");
        processedHtml = processedHtml.replaceAll("<hr>", "<hr />");
        return processedHtml;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants