Skip to content

Commit

Permalink
another rework ...
Browse files Browse the repository at this point in the history
summary:
- removed latest access, added modified and version
- more robust parsing using rapper
- removed all EasyRdf Graph related stuff, because it threw exception
  if data was invalid (e.g. value = 1.0.1 and datatype = decimal)
  - replaced it with a simplified Graph implementation
- and much more
  • Loading branch information
k00ni committed Apr 17, 2024
1 parent 88cd488 commit c8f84cc
Show file tree
Hide file tree
Showing 19 changed files with 3,946 additions and 2,953 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ As long as the portal is online everything is fine but as soon as it goes offlin

https://data.bioontology.org/documentation

Related script: [scripts/bin/read-bioportal](scripts/bin/read-bioportal)
Related script: [scripts/src/Extractor/BioPortal.php](scripts/src/Extractor/BioPortal.php)

**Notes:**
* scripts tries frontend view of ontology first to get a RDF file, if no luck, it tries `ontology.links.download` (such as https://data.bioontology.org/ontologies/INFRARISK/download), because they come in non-RDF formats (e.g. obo)
Expand All @@ -79,7 +79,7 @@ Related script: [scripts/bin/read-bioportal](scripts/bin/read-bioportal)

https://archivo.dbpedia.org/list

Related script: [scripts/bin/read-dbpedia-archivo](scripts/bin/read-dbpedia-archivo)
Related script: [scripts/src/Extractor/DBpediaArchivo.php](scripts/src/Extractor/DBpediaArchivo.php)

**Notes:**
* Used value of "Latest Timestamp" for latest access, "2020.06.10-175249" is interpreted as "2020-06-10 00:00:00"
Expand All @@ -88,7 +88,7 @@ Related script: [scripts/bin/read-dbpedia-archivo](scripts/bin/read-dbpedia-arch

https://lov.linkeddata.es/dataset/lov/

Related script: [scripts/bin/read-linked-open-vocabularies](scripts/bin/read-linked-open-vocabularies)
Related script: [scripts/src/Extractor/LinkedOpenVocabularies.php](scripts/src/Extractor/LinkedOpenVocabularies.php)

**Notes:**
* Used value of `dct:modified` for latest access; because the field only contains the date, the time is always set to `00:00:00`.
Expand All @@ -97,11 +97,11 @@ Related script: [scripts/bin/read-linked-open-vocabularies](scripts/bin/read-lin

https://www.ebi.ac.uk/ols4/

Related script: [scripts/bin/read-linked-open-vocabularies](scripts/bin/read-ontology-lookup-service)
Related script: [scripts/src/Extractor/OntologyLookupService.php](scripts/src/Extractor/OntologyLookupService.php)

**Notes:**
* Warning: Currently ignoring all ontologies with no `fileLocation` field set in ontology configuration
* ontology.uploaded is used for latest access
* Field `ontology.uploaded` is used for latest access

## FAQ

Expand Down
5,530 changes: 3,018 additions & 2,512 deletions index.csv

Large diffs are not rendered by default.

510 changes: 255 additions & 255 deletions manually-maintained-metadata-about-ontologies.csv

Large diffs are not rendered by default.

5 changes: 3 additions & 2 deletions scripts/bin/bootstrap.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@
// CSV
define(
'INDEX_CSV_HEAD_STRING',
'"ontology title","ontology iri","summary","authors","contributors","license information","project page","source page","latest json-ld file","latest n3 file","latest ntriples file","latest rdf/xml file","latest turtle file","latest access","source title","source url"'
'"ontology title","ontology iri","summary","authors","contributors","license information","project page","source page","latest json-ld file","latest n3 file","latest ntriples file","latest rdf/xml file","latest turtle file","modified","version","source title","source url"'
);

define('MANUALLY_MAINTAINED_METADATA_ABOUT_ONTOLOGIES_CSV', 'manually-maintained-metadata-about-ontologies.csv');

define('SQLITE_FILE_PATH', SCRIPTS_DIR_PATH.'var'.DIRECTORY_SEPARATOR.'temporary-index.db');
define('VAR_FOLDER_PATH', SCRIPTS_DIR_PATH.'var'.DIRECTORY_SEPARATOR);
define('SQLITE_FILE_PATH', VAR_FOLDER_PATH.'index.db');

// properties usually used to determine a title
$titleProperties = [
Expand Down
2 changes: 1 addition & 1 deletion scripts/bin/renew_index.php
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

use App\Cache;
use App\Command\MergeInManuallyMaintainedMetadata;
use App\Extractor\BioPortal;
use App\Extractor\DBpediaArchivo;
use App\Extractor\LinkedOpenVocabularies;
use App\Extractor\OntologyLookupService;
Expand All @@ -22,7 +23,6 @@
(new DBpediaArchivo($cache, $dataFactory, $temporaryIndex))->run();
(new OntologyLookupService($cache, $dataFactory, $temporaryIndex))->run();
(new BioPortal($cache, $dataFactory, $temporaryIndex))->run();
return;

// finalize temporary index and write index.csv
(new MergeInManuallyMaintainedMetadata($cache, $dataFactory, $temporaryIndex))->run();
Expand Down
1 change: 0 additions & 1 deletion scripts/composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
"sweetrdf/in-memory-store-sqlite": "^1.1.0",
"sweetrdf/quick-rdf": "^2.0",
"sweetrdf/quick-rdf-io": "^1.0",
"sweetrdf/rdfinterface2easyrdf": "^0.3.1",
"symfony/cache": "^7"
},
"require-dev": {
Expand Down
12 changes: 8 additions & 4 deletions scripts/src/Cache.php
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,23 @@ private function getCacheInstance(string $namespace): AbstractAdapter

private function createSimplifiedFilename(string $fileUrl): string
{
return preg_replace('/[^a-z0-9\-_]/ism', '_', $fileUrl);
return (string) preg_replace('/[^a-z0-9\-_]/ism', '_', $fileUrl);
}

/**
* @return non-empty-string
*
* @throws \Exception
*/
public function getCachedFilePathForFileUrl(string $fileUrl): string
{
$fileRes = $this->getLocalFileResourceForFileUrl($fileUrl);

if (is_resource($fileRes)) {
// generate simplified filename for local storage
return $this->filesFolder.$this->createSimplifiedFilename($fileUrl);
/** @var non-empty-string */
$result = $this->filesFolder.$this->createSimplifiedFilename($fileUrl);
return $result;
} else {
throw new Exception('Got no file resource for '.$fileUrl);
}
Expand All @@ -66,7 +70,7 @@ public function getLocalFileResourceForFileUrl(string $fileUrl)
// timeout until conntected
$curl->setConnectTimeout(5);
// time of curl to execute (seconds)
$curl->setTimeout(300);
$curl->setTimeout(3000);

$curl->setMaximumRedirects(10);
$curl->setOpt(CURLOPT_FOLLOWLOCATION, true); // follow redirects
Expand Down Expand Up @@ -108,7 +112,7 @@ public function sendCachedRequest(string $url, string $namespace): string
// timeout until conntected
$curl->setConnectTimeout(5);
// time of curl to execute
$curl->setTimeout(300);
$curl->setTimeout(3000);

$curl->setMaximumRedirects(10);
$curl->setOpt(CURLOPT_FOLLOWLOCATION, true); // follow redirects
Expand Down
3 changes: 2 additions & 1 deletion scripts/src/Command/MergeInManuallyMaintainedMetadata.php
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ public function run(): void
$entry->setLatestRdfXmlFile($row[11]);
$entry->setLatestTurtleFile($row[12]);

$entry->setLatestAccess($row[13]);
$entry->setModified($row[13]);
$entry->setVersion($row[14]);

$this->temporaryIndex->storeEntries([$entry]);
} elseif (is_array($entryData) && 'Manually maintained' === $entryData['source_title']) {
Expand Down
Loading

0 comments on commit c8f84cc

Please sign in to comment.