Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: xml validation #660

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

Rossi-Luciano
Copy link
Collaborator

O que esse PR faz?

Este Pull Request adiciona funcionalidades para a validação de arquivos XML e a geração de relatórios em formato CSV. Especificamente, ele inclui:

  • Função get_xml_tree: Lê o conteúdo de um arquivo XML e retorna sua árvore XML.
  • Função get_data: Recupera dados de arquivos JSON, utilizados como referência para validações.
  • Função create_report: Cria um relatório de validação em CSV a partir de um arquivo XML.
  • Função save_csv: Salva os resultados da validação em um arquivo CSV.
  • Função validate_xml_content: Executa várias validações em um arquivo XML, agrupando-as por categorias como atributos do artigo, idiomas, tipos de artigo, etc.
  • Funções de Validação Específicas: Inclui validações detalhadas como validação de afiliações, resumos visuais, idiomas dos artigos, atributos dos artigos, entre outros.

Essas adições facilitam a verificação e análise de arquivos XML de acordo com critérios pré-definidos, melhorando a qualidade e conformidade dos dados processados.

Onde a revisão poderia começar?

Por commit.

Como este poderia ser testado manualmente?

  1. Preparação:
  • Coloque arquivos XML na pasta de entrada.
  • Crie uma pasta de saída para armazenar os arquivos CSV gerados.
  1. Execução:
    Execute o script com o seguintes comando:
    python3 packtools/xml_validation.py -i packtools/xml_validation/xmls/ -o packtools/xml_validation/reports/

  2. Verificação:
    Verifique os arquivos CSV gerados na pasta de saída para garantir que os resultados das validações estão corretos e completos.

Exemplo do CSV gerado:

title,parent,parent_id,parent_article_type,parent_lang,item,sub_item,validation_type,response,expected_value,got_value,message,advice,data,group,exception,exc_traceback,function,sps_pkg_name
Journal acronym element validation,article,,research-article,en,journal-meta,"journal-id[@journal-id-type=""publisher-id""]",value,CRITICAL,aaa,rlae,"Got rlae, expected aaa",Provide an acronym value as expected: aaa,{'acronym': 'rlae'},journal,,,,
Publisher name element validation,article,,research-article,en,publisher,publisher-name,value,CRITICAL,aaa,Escola de Enfermagem de Ribeirão Preto / Universidade de São Paulo,"Got Escola de Enfermagem de Ribeirão Preto / Universidade de São Paulo, expected aaa",Provide the expected publisher name: aaa,,journal,,,,
Publisher name element validation,article,,research-article,en,publisher,publisher-name,value,CRITICAL,"['aaa', 'bbb']",['Escola de Enfermagem de Ribeirão Preto / Universidade de São Paulo'],"Got ['Escola de Enfermagem de Ribeirão Preto / Universidade de São Paulo'], expected ['aaa', 'bbb']",Complete the following items in the XML: bbb,,journal,,,,
Journal ID element validation,article,,research-article,en,journal-meta,journal-id,value,CRITICAL,ccc,Rev Lat Am Enfermagem,"Got Rev Lat Am Enfermagem, expected ccc",Provide an nlm-ta value as expected: ccc,,journal,,,,
Article element dtd-version attribute validation,article,,research-article,en,article,@dtd-version,value in list,OK,['1.1'],1.1,"Got 1.1, expected ['1.1']",,"{'parent': 'article', 'lang': 'en', 'article_type': 'research-article', 'article_id': None, 'line_number': 1, 'subject': 'Original Article', 'specific_use': 'sps-1.9', 'dtd_version': '1.1'}",article attributes,,,,
Article element specific-use attribute validation,article,,research-article,en,article,@specific-use,value in list,OK,"['sps-1.1', 'sps-1.2', 'sps-1.3', 'sps-1.4', 'sps-1.5', 'sps-1.6', 'sps-1.7', 'sps-1.8', 'sps-1.9', 'sps-1.10']",sps-1.9,"Got sps-1.9, expected ['sps-1.1', 'sps-1.2', 'sps-1.3', 'sps-1.4', 'sps-1.5', 'sps-1.6', 'sps-1.7', 'sps-1.8', 'sps-1.9', 'sps-1.10']",,"{'parent': 'article', 'lang': 'en', 'article_type': 'research-article', 'article_id': None, 'line_number': 1, 'subject': 'Original Article', 'specific_use': 'sps-1.9', 'dtd_version': '1.1'}",article attributes,,,,
Article element lang attribute validation,article,,research-article,en,article,@xml:lang,value in list,OK,"['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar', 'as', 'av', 'ay', 'az', 'ba', 'be', 'bg', 'bh', 'bi', 'bm', 'bn', 'bo', 'br', 'bs', 'ca', 'ce', 'ch', 'co', 'cr', 'cs', 'cu', 'cv', 'cy', 'da', 'de', 'dv', 'dz', 'ee', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fj', 'fo', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'gv', 'ha', 'he', 'hi', 'ho', 'hr', 'ht', 'hu', 'hy', 'hz', 'ia', 'id', 'ie', 'ig', 'ii', 'ik', 'io', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'kg', 'ki', 'kj', 'kk', 'kl', 'km', 'kn', 'ko', 'kr', 'ks', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lg', 'li', 'ln', 'lo', 'lt', 'lu', 'lv', 'mg', 'mh', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'na', 'nb', 'nd', 'ne', 'ng', 'nl', 'nn', 'no', 'nr', 'nv', 'ny', 'oc', 'oj', 'om', 'or', 'os', 'pa', 'pi', 'pl', 'ps', 'pt', 'qu', 'rm', 'rn', 'ro', 'ru', 'rw', 'sa', 'sc', 'sd', 'se', 'sg', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'ss', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'ti', 'tk', 'tl', 'tn', 'to', 'tr', 'ts', 'tt', 'tw', 'ty', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'wo', 'xh', 'yi', 'yo', 'za', 'zh', 'zu']",en,"Got en, expected ['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar', 'as', 'av', 'ay', 'az', 'ba', 'be', 'bg', 'bh', 'bi', 'bm', 'bn', 'bo', 'br', 'bs', 'ca', 'ce', 'ch', 'co', 'cr', 'cs', 'cu', 'cv', 'cy', 'da', 'de', 'dv', 'dz', 'ee', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fj', 'fo', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'gv', 'ha', 'he', 'hi', 'ho', 'hr', 'ht', 'hu', 'hy', 'hz', 'ia', 'id', 'ie', 'ig', 'ii', 'ik', 'io', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'kg', 'ki', 'kj', 'kk', 'kl', 'km', 'kn', 'ko', 'kr', 'ks', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lg', 'li', 'ln', 'lo', 'lt', 'lu', 'lv', 'mg', 'mh', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'na', 'nb', 'nd', 'ne', 'ng', 'nl', 'nn', 'no', 'nr', 'nv', 'ny', 'oc', 'oj', 'om', 'or', 'os', 'pa', 'pi', 'pl', 'ps', 'pt', 'qu', 'rm', 'rn', 'ro', 'ru', 'rw', 'sa', 'sc', 'sd', 'se', 'sg', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'ss', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'ti', 'tk', 'tl', 'tn', 'to', 'tr', 'ts', 'tt', 'tw', 'ty', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'wo', 'xh', 'yi', 'yo', 'za', 'zh', 'zu']",,"{'parent': 'article', 'lang': 'en', 'article_type': 'research-article', 'article_id': None, 'line_number': 1, 'subject': 'Original Article'}",article attributes,,,,
Article element lang attribute validation,sub-article,s1,translation,pt,sub-article,@xml:lang,value in list,OK,"['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar', 'as', 'av', 'ay', 'az', 'ba', 'be', 'bg', 'bh', 'bi', 'bm', 'bn', 'bo', 'br', 'bs', 'ca', 'ce', 'ch', 'co', 'cr', 'cs', 'cu', 'cv', 'cy', 'da', 'de', 'dv', 'dz', 'ee', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fj', 'fo', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'gv', 'ha', 'he', 'hi', 'ho', 'hr', 'ht', 'hu', 'hy', 'hz', 'ia', 'id', 'ie', 'ig', 'ii', 'ik', 'io', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'kg', 'ki', 'kj', 'kk', 'kl', 'km', 'kn', 'ko', 'kr', 'ks', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lg', 'li', 'ln', 'lo', 'lt', 'lu', 'lv', 'mg', 'mh', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'na', 'nb', 'nd', 'ne', 'ng', 'nl', 'nn', 'no', 'nr', 'nv', 'ny', 'oc', 'oj', 'om', 'or', 'os', 'pa', 'pi', 'pl', 'ps', 'pt', 'qu', 'rm', 'rn', 'ro', 'ru', 'rw', 'sa', 'sc', 'sd', 'se', 'sg', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'ss', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'ti', 'tk', 'tl', 'tn', 'to', 'tr', 'ts', 'tt', 'tw', 'ty', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'wo', 'xh', 'yi', 'yo', 'za', 'zh', 'zu']",pt,"Got pt, expected ['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar', 'as', 'av', 'ay', 'az', 'ba', 'be', 'bg', 'bh', 'bi', 'bm', 'bn', 'bo', 'br', 'bs', 'ca', 'ce', 'ch', 'co', 'cr', 'cs', 'cu', 'cv', 'cy', 'da', 'de', 'dv', 'dz', 'ee', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fj', 'fo', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'gv', 'ha', 'he', 'hi', 'ho', 'hr', 'ht', 'hu', 'hy', 'hz', 'ia', 'id', 'ie', 'ig', 'ii', 'ik', 'io', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'kg', 'ki', 'kj', 'kk', 'kl', 'km', 'kn', 'ko', 'kr', 'ks', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lg', 'li', 'ln', 'lo', 'lt', 'lu', 'lv', 'mg', 'mh', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'na', 'nb', 'nd', 'ne', 'ng', 'nl', 'nn', 'no', 'nr', 'nv', 'ny', 'oc', 'oj', 'om', 'or', 'os', 'pa', 'pi', 'pl', 'ps', 'pt', 'qu', 'rm', 'rn', 'ro', 'ru', 'rw', 'sa', 'sc', 'sd', 'se', 'sg', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'ss', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'ti', 'tk', 'tl', 'tn', 'to', 'tr', 'ts', 'tt', 'tw', 'ty', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'wo', 'xh', 'yi', 'yo', 'za', 'zh', 'zu']",,"{'parent': 'sub-article', 'lang': 'pt', 'article_type': 'translation', 'article_id': 's1', 'line_number': 1306, 'subject': 'Artigo Original'}",article attributes,,,,
Article element lang attribute validation,sub-article,s2,translation,es,sub-article,@xml:lang,value in list,OK,"['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar', 'as', 'av', 'ay', 'az', 'ba', 'be', 'bg', 'bh', 'bi', 'bm', 'bn', 'bo', 'br', 'bs', 'ca', 'ce', 'ch', 'co', 'cr', 'cs', 'cu', 'cv', 'cy', 'da', 'de', 'dv', 'dz', 'ee', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fj', 'fo', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'gv', 'ha', 'he', 'hi', 'ho', 'hr', 'ht', 'hu', 'hy', 'hz', 'ia', 'id', 'ie', 'ig', 'ii', 'ik', 'io', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'kg', 'ki', 'kj', 'kk', 'kl', 'km', 'kn', 'ko', 'kr', 'ks', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lg', 'li', 'ln', 'lo', 'lt', 'lu', 'lv', 'mg', 'mh', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'na', 'nb', 'nd', 'ne', 'ng', 'nl', 'nn', 'no', 'nr', 'nv', 'ny', 'oc', 'oj', 'om', 'or', 'os', 'pa', 'pi', 'pl', 'ps', 'pt', 'qu', 'rm', 'rn', 'ro', 'ru', 'rw', 'sa', 'sc', 'sd', 'se', 'sg', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'ss', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'ti', 'tk', 'tl', 'tn', 'to', 'tr', 'ts', 'tt', 'tw', 'ty', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'wo', 'xh', 'yi', 'yo', 'za', 'zh', 'zu']",es,"Got es, expected ['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar', 'as', 'av', 'ay', 'az', 'ba', 'be', 'bg', 'bh', 'bi', 'bm', 'bn', 'bo', 'br', 'bs', 'ca', 'ce', 'ch', 'co', 'cr', 'cs', 'cu', 'cv', 'cy', 'da', 'de', 'dv', 'dz', 'ee', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'ff', 'fi', 'fj', 'fo', 'fr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gu', 'gv', 'ha', 'he', 'hi', 'ho', 'hr', 'ht', 'hu', 'hy', 'hz', 'ia', 'id', 'ie', 'ig', 'ii', 'ik', 'io', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'kg', 'ki', 'kj', 'kk', 'kl', 'km', 'kn', 'ko', 'kr', 'ks', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lg', 'li', 'ln', 'lo', 'lt', 'lu', 'lv', 'mg', 'mh', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'na', 'nb', 'nd', 'ne', 'ng', 'nl', 'nn', 'no', 'nr', 'nv', 'ny', 'oc', 'oj', 'om', 'or', 'os', 'pa', 'pi', 'pl', 'ps', 'pt', 'qu', 'rm', 'rn', 'ro', 'ru', 'rw', 'sa', 'sc', 'sd', 'se', 'sg', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'ss', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'ti', 'tk', 'tl', 'tn', 'to', 'tr', 'ts', 'tt', 'tw', 'ty', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'wo', 'xh', 'yi', 'yo', 'za', 'zh', 'zu']",,"{'parent': 'sub-article', 'lang': 'es', 'article_type': 'translation', 'article_id': 's2', 'line_number': 1526, 'subject': 'Artículo Original'}",article attributes,,,,
Article type validation,article,,research-article,en,article,@article-type,value in list,OK,"['article-commentary', 'book-review', 'brief-report', 'case-report', 'correction', 'editorial', 'in-brief', 'letter', 'other', 'partial-retraction', 'rapid-communication', 'reply', 'research-article', 'retraction', 'review-article', 'data-article']",research-article,"Got research-article, expected ['article-commentary', 'book-review', 'brief-report', 'case-report', 'correction', 'editorial', 'in-brief', 'letter', 'other', 'partial-retraction', 'rapid-communication', 'reply', 'research-article', 'retraction', 'review-article', 'data-article']",,"{'parent': 'article', 'lang': 'en', 'article_type': 'research-article', 'article_id': None, 'line_number': 1, 'subject': 'Original Article', 'specific_use': 'sps-1.9', 'dtd_version': '1.1'}",article attributes,,,,
Article type validation,sub-article,s1,translation,pt,article,@article-type,value in list,OK,"['translation', 'editorial', 'announcement', 'correction', 'retraction', 'letter', 'brief-report', 'addendum', 'reply', 'discussion', 'case-report', 'obituary', 'book-review', 'in-memoriam', 'news', 'other']",translation,"Got translation, expected ['translation', 'editorial', 'announcement', 'correction', 'retraction', 'letter', 'brief-report', 'addendum', 'reply', 'discussion', 'case-report', 'obituary', 'book-review', 'in-memoriam', 'news', 'other']",,"{'parent': 'sub-article', 'lang': 'pt', 'article_type': 'translation', 'article_id': 's1', 'line_number': 1306, 'subject': 'Artigo Original', 'specific_use': 'sps-1.9', 'dtd_version': '1.1'}",article attributes,,,,
Article type validation,sub-article,s2,translation,es,article,@article-type,value in list,OK,"['translation', 'editorial', 'announcement', 'correction', 'retraction', 'letter', 'brief-report', 'addendum', 'reply', 'discussion', 'case-report', 'obituary', 'book-review', 'in-memoriam', 'news', 'other']",translation,"Got translation, expected ['translation', 'editorial', 'announcement', 'correction', 'retraction', 'letter', 'brief-report', 'addendum', 'reply', 'discussion', 'case-report', 'obituary', 'book-review', 'in-memoriam', 'news', 'other']",,"{'parent': 'sub-article', 'lang': 'es', 'article_type': 'translation', 'article_id': 's2', 'line_number': 1526, 'subject': 'Artículo Original', 'specific_use': 'sps-1.9', 'dtd_version': '1.1'}",article attributes,,,,
,,,,,,,,,,,,,,,Function requires list of subjects,<traceback object at 0x7a5195316740>,article attributes,article-abstract-en-sub-articles-pt-es.xml

Screenshots

NA

Quais são tickets relevantes?

NA

Referências

NA

@Rossi-Luciano Rossi-Luciano marked this pull request as draft July 30, 2024 23:20
def get_data(filename, key, sps_version=None):
sps_version = sps_version or "default"
# Reads contents with UTF-8 encoding and returns str.
content = files(f'packtools.sps.sps_versions.{sps_version}').joinpath(f"{filename}.json").read_text()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rossi-Luciano

def get_data(filename, key, sps_version=None):
    sps_version = sps_version or "default"
    # Reads contents with UTF-8 encoding and returns str.
    content = (
        files(f"packtools.sps.sps_versions")
        .joinpath(f"{sps_version}")
        .joinpath(f"{filename}.json")
        .read_text()
    )
    x = " ".join(content.split())
    fixed = x.replace(", ]", "]").replace(", }", "}")
    data = json.loads(fixed)
    return data[key]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,41 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rossi-Luciano move todos os json de volta para sps-1.9 e sps-1.10

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent_id=None,
parent_article_type=self.xmltree.get("article-type"),
parent_lang=self.xmltree.get("{http://www.w3.org/XML/1998/namespace}lang"),
item="journal-meta",
Copy link
Member

@robertatakenaka robertatakenaka Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rossi-Luciano

item="journal-id"
sub-item="@journal-id-type='publisher-id'"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -359,7 +364,7 @@ def validate(self, expected_values):
nlm_ta = JournalIdValidation(self.xmltree)

resp_journal_meta = list(issn.validate_issn(expected_values['issns'])) + \
acronym.acronym_validation(expected_values['acronym']) + \
list(acronym.acronym_validation(expected_values['acronym'])) + \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rossi-Luciano trocar este trecho por yield ou yield from. Talvez até eliminaria este método

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 134 to 135
sps_version = sps_version.replace("-", "_")
sps_version = sps_version.replace(".", "_")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rossi-Luciano remover;

    sps_version = sps_version.replace("-", "_")
    sps_version = sps_version.replace(".", "_")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@robertatakenaka robertatakenaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rossi-Luciano verificar os comentários

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants