Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes with single value in multivalued field triggering validation errors #354

Open
kevinschaper opened this issue Nov 2, 2021 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@kevinschaper
Copy link
Collaborator

I noticed that I was getting a lot of these validation errors:

[ERROR][INVALID_NODE_PROPERTY_VALUE_TYPE] FB:FBrf0013765 - Multi-valued node property 'xref' expected to be of type '<class 'list'>'
[ERROR][INVALID_NODE_PROPERTY_VALUE_TYPE] FB:FBrf0013765 - Multi-valued node property 'authors' expected to be of type '<class 'list'>'
[ERROR][INVALID_NODE_PROPERTY_VALUE_TYPE] FB:FBrf0000035 - Multi-valued node property 'xref' expected to be of type '<class 'list'>'

When I looked at the file, I found that I had single values in multi-value fields (these were publication nodes, so it was xref and author). For example:

id	category	name	xref	provided_by	authors	creation_date	keywords	mesh_terms	summary	type
biolink:Publication|biolink:NamedThing  Centrosome inheritance in the male germ line of Drosophila requires hu-li tai-shao function.    FB:FBrf0187243          P.G. Wilson     2005                    ....absract text...      IAO:0000013

It looks like FB:FBrf0187243 and P.G. Wilson aren't recognized as lists simply because they only have a single value - am I right that the format for a multivalue field doesn't include any sort of brackets to denote a list, just the pipe separator?

I'm not sure what the right approach to fix it is, but my guess is that the tsv reader should be biolink-aware enough to put the single values into a list if it's multivalued field?

@kevinschaper kevinschaper added the bug Something isn't working label Nov 2, 2021
@deepakunni3
Copy link
Member

@kevinschaper This is definitely a bug in KGX. It should be aware that xref, authors (and other fields) are multi-valued and parse them as such.

The fix should be straightforward. I'll have a PR for this soon. I think the way the multivalued properties are parsed is a little brittle (my bad!). This can be a good opportunity to tidy a few things up :)

@sierra-moxon
Copy link
Member

@kevinschaper - is this impacting the pydantic validation you're doing downstream?

@kevinschaper
Copy link
Collaborator Author

Whoops, I should have replied a year ago. I was just looking at whether it would be practical to put kgx validate back into my pipeline and I noticed this problem again.

I'm not sure if I made a different issue for it at the time, but the biggest challenge is definitely the number of repeats for each type of error. It would definitely be more useful if it just reported the first N occurrences (5, 10, 30?) of each error, rather than an exhaustive list of each record that failed.

@RichardBruskiewich
Copy link
Collaborator

I'm not sure if I made a different issue for it at the time, but the biggest challenge is definitely the number of repeats for each type of error. It would definitely be more useful if it just reported the first N occurrences (5, 10, 30?) of each error, rather than an exhaustive list of each record that failed.

Hmm... didn't we fix the KGX reporting some time ago, to be less verbose?

@goodb
Copy link

goodb commented Nov 2, 2022

Hi folks. I've been playing with kgx a bit and faced the same problem. I introduced this in my validate test to make the output a little more tractable. Seems to work for me in a validate/fix/validate dev cycle.

   validator.validate(the_graph)
    if len(validator.errors) > 0:
        for e in validator.errors:
            error_dict = validator.errors[e]
            for k in error_dict.keys():
                print("root error type: ", k, " specific errors:", error_dict[k].keys())
                for ek in error_dict[k].keys():
                    print(ek, " element count ", len(error_dict[k][ek]))
                    n = 0
                    for element in error_dict[k][ek]:
                        n += 1
                        print(element)
                        if n > n_error_examples:
                            break
            print("error")```

@RichardBruskiewich
Copy link
Collaborator

Resolved by #415 (will be in master once the PR is processed.. not sure about when the patch will show up in an official release...)

@goodb
Copy link

goodb commented Nov 2, 2022

@RichardBruskiewich the problem with reporting very large numbers of errors isn't limited to the case here. Suppose that is a separate issue more related to the interface with the validation system.

@RichardBruskiewich
Copy link
Collaborator

Hi @goodb,
As I noted earlier (above) in this issue, I seem to recall that we rehabilitated the error reporting to be somewhat more concise and possibly hierarchical. That said, I've not looked at that aspect of the code base for well over a year or so.

That said, I'll take another look at it over the next few days to confirm the status of things. If the validation reporting is still problematic, we might wish to open up a separate ticket.

I'm also going to compare our latest Translator validation code (in the pypi.org reasoner-validator package) which generally validates Translator web service data (including knowledge graphs) to check if there is suitable overlap in validation processes to apply a DRY upgrade to the KGX validation.

Cheers
Richard

@kevinschaper kevinschaper reopened this Mar 16, 2023
@kevinschaper
Copy link
Collaborator Author

It looks like #415 never made it in. I went looking for this issue because I just got the error with kgx 1.7.2, I wasn't able to run 2.0 in the cli, but I'll try running it as a module.

@RichardBruskiewich
Copy link
Collaborator

Hi @kevinschaper, I've just pinged Sierra to inquire about this since I'm not that active on the KGX front right now.

@sierra-moxon
Copy link
Member

sierra-moxon commented Mar 21, 2023

I went ahead and fixed up the changes in #415 to satisfy the good tests (see #439); I don't think they solve the entirety of this issue, so keeping this open pending more changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants