Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplication #73

Open
ericaVoss opened this issue Aug 31, 2015 · 3 comments
Open

Duplication #73

ericaVoss opened this issue Aug 31, 2015 · 3 comments

Comments

@ericaVoss
Copy link
Contributor

Today I was searching for an example for my paper I’m noticing a lot of duplication inside LAERTES.

  1. I made a comment here: Bad SemMed counts and linkouts - accidentally creating a partial crossproduct instead of a selective join #63
  2. I would have assumed there were only two rows here:
    SELECT *
    FROM DRUG_HOI_EVIDENCE
    WHERE DRUG_HOI_RELATIONSHIP = '19059547-4066289'
  3. In this one MESH looks okay but still got duplication on AERS and SEMMEDDB:
    SELECT *
    FROM DRUG_HOI_EVIDENCE
    WHERE DRUG_HOI_RELATIONSHIP = '1309944-4066289'
  4. Here is one more, you can see no duplication in AERS, MESH, but in SEMMEDB:
    SELECT *
    FROM DRUG_HOI_EVIDENCE
    WHERE DRUG_HOI_RELATIONSHIP = '914335-444070'
    ORDER BY 3,4
@ericaVoss ericaVoss added this to the Fall 2015 LAERTES Update milestone Aug 31, 2015
@cgreich
Copy link

cgreich commented Aug 31, 2015

Erica. Is this a vocabulary problem?

@ericaVoss
Copy link
Contributor Author

@cgreich - I don't think so - but we have to double check still.

@rkboyce
Copy link
Contributor

rkboyce commented Oct 16, 2015

Ben also noticed this issue:

Looking here:

http://api.ohdsi.org/WebAPI/CS1/evidence/drug/19016237

I see the same concept code (436666) twice for three evidence sources:

{"EVIDENCE":"MEDLINE_SemMedDB_CR","MODALITY":true,"LINKOUT":"http://dbmi-icode-01.dbmi.pitt.edu/l/index.php?id=pm-semmed-26728","STATISTIC_TYPE":"COUNT","HOI":"436666","COUNT":2}

{"EVIDENCE":"MEDLINE_SemMedDB_Other","MODALITY":true,"LINKOUT":"http://dbmi-icode-01.dbmi.pitt.edu/l/index.php?id=pm-semmed-26774","STATISTIC_TYPE":"COUNT","HOI":"436666","COUNT":47}

{"EVIDENCE":"MEDLINE_SemMedDB_ClinTrial","MODALITY":true,"LINKOUT":"http://dbmi-icode-01.dbmi.pitt.edu/l/index.php?id=pm-semmed-30508","STATISTIC_TYPE":"COUNT","HOI":"436666","COUNT":10}

{"EVIDENCE":"MEDLINE_SemMedDB_CR","MODALITY":true,"LINKOUT":"http://dbmi-icode-01.dbmi.pitt.edu/l/index.php?id=pm-semmed-88519","STATISTIC_TYPE":"COUNT","HOI":"436666","COUNT":2}

{"EVIDENCE":"MEDLINE_SemMedDB_Other","MODALITY":true,"LINKOUT":"http://dbmi-icode-01.dbmi.pitt.edu/l/index.php?id=pm-semmed-88565","STATISTIC_TYPE":"COUNT","HOI":"436666","COUNT":46}

{"EVIDENCE":"MEDLINE_SemMedDB_ClinTrial","MODALITY":true,"LINKOUT":"http://dbmi-icode-01.dbmi.pitt.edu/l/index.php?id=pm-semmed-99657","STATISTIC_TYPE":"COUNT","HOI":"436666","COUNT":10}

The counts are mismatched for one pair as well. The linkout URLs differ. Any thoughts?

I did some checking and found that the reason for the duplications is as follows:

  • SemMed codes all named entities using UMLS concepts
  • There are many UMLS concepts can be mapped to terms in multiple vocabularies. This leads to multiple ways to get to a rxnorm drug and snomed HOI. E.g, In the example below, the UMLS concept is C0033953 (sexual dysfunction) which is mapped to:
---------------------
concept_id |     concept_name      | domain_id | vocabulary_id | concept_class_id | standard_concept | concept_code | valid_start_date | valid_end_date | invalid_reason
------------+-----------------------+-----------+---------------+------------------+------------------+--------------+------------------+----------------+----------------
   36919181 | Psychosexual disorder | Condition | MedDRA        | PT               | C                | 10037222     | 1970-01-01       | 2099-12-31     |


concept_id |            concept_name            | domain_id | vocabulary_id | concept_class_id | standard_concept | concept_code | valid_start_date | valid_end_date | invalid_reason
------------+------------------------------------+-----------+---------------+------------------+------------------+--------------+------------------+----------------+----------------
   45611093 | Sexual Dysfunctions, Psychological | Condition | MeSH          | Main Heading     |                  | D020018      | 1970-01-01       | 2099-12-31     |
-----------------
  • Thus, the SPARQL query that pulls the count data should use a distinct for the drug, hoi, modality, and study type but currently does not because I include the oa id (?an) which could result in differences based on how the UNION operates.
---------------------
PREFIX ohdsi:<http://purl.org/net/ohdsi#>
PREFIX oa:<http://www.w3.org/ns/oa#>
PREFIX meddra:<http://purl.bioontology.org/ontology/MEDDRA/>
PREFIX ncbit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX poc: <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc#>

SELECT count(distinct ?an) ?drug ?hoi ?modality ?studyType
FROM <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc>
WHERE {  
  ?an a ohdsi:SemMedDrugHOIAnnotation;        
    oa:hasTarget ?target;    
    oa:hasBody ?body.   

  ?target ohdsi:MeshStudyType ?studyType.   
  ?body poc:modality ?modality.   

  {?body ohdsi:ImedsDrug ?drug.}
    UNION    {     ?body ohdsi:adeAgents ?agents.     ?agents ohdsi:ImedsDrug ?drug.   }
  {?body ohdsi:ImedsHoi ?hoi.}
    UNION    {     ?body ohdsi:adeEffects ?effects.     ?effects ohdsi:ImedsHoi ?hoi.   }
 }
------------

So, I will have to fix this but there is no time before the F2F. So, I will remove duplicates at the WebAPI level until I run this corrected query and reload semmed data:

-----------------------
PREFIX ohdsi:<http://purl.org/net/ohdsi#>
PREFIX oa:<http://www.w3.org/ns/oa#>
PREFIX meddra:<http://purl.bioontology.org/ontology/MEDDRA/>
PREFIX ncbit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX poc: <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc#>

SELECT count(distinct ?drug ?hoi ?modality ?studyType)
FROM <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc>
WHERE {  
  ?an a ohdsi:SemMedDrugHOIAnnotation;        
    oa:hasTarget ?target;    
    oa:hasBody ?body.   

  ?target ohdsi:MeshStudyType ?studyType.   
  ?body poc:modality ?modality.   

  {?body ohdsi:ImedsDrug ?drug.}
    UNION    {     ?body ohdsi:adeAgents ?agents.     ?agents ohdsi:ImedsDrug ?drug.   }
  {?body ohdsi:ImedsHoi ?hoi.}
    UNION    {     ?body ohdsi:adeEffects ?effects.     ?effects ohdsi:ImedsHoi ?hoi.   }
 }
  • Btw, the reason for the difference in counts is that there is one article tagged with UMLS C0020594 (hypoactive sexual desire disorder) that gets mapped to MeSH D020018. Right now, so it gets picked up by one oa but not the other. This will be merged in the distinct set when the query is corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants