-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target sample use cases #1
Comments
@yarikoptic, thanks for starting this thread - datalad/datalad#2257 seems to contain some nice examples of queries that might be relevant to users, and I would love to hear more. From our conversation and what I can tell, there are likely two types of searches
as I mentioned, I am experimenting on using CouchDB/NoSQL database to approach dataset searchability. With the provided APIs, the second need appears to be feasible (with some limitations), although the search condition interface seems to be a little bit more complex than a non-technical user can handle. Your example of regardless what types of search we have to handle, meta extraction and distillation are necessary for making these search practical - this process narrows down the searchable information to a small memory footprint so it can be processed quickly. Here are some of my examples using CouchDB + JSON-encoded metadata to perform the Approach 1.I can use the cat << 'EOF' | curl -X POST -H "Content-Type: application/json" --data-binary @- https://neurojson.io:7777/openneuro/_find
{
"selector": {
"dataset_description\\.json.Authors": {
"$elemMatch": {
"$regex": "[hH]axby"
}
}
},
"fields": [
"_id",
"dataset_description\\.json.Authors"
],
"limit": 5,
"skip": 0
}
EOF This query need to search all documents in the openneuro database (~1 GB of searchable metadata content), so it needs about 5-6 s processing time per query. Also, this is programmable so you can use it in combination of other conditions. but as I said, the user interface is a bit complex. need to think of a way as the front-end to this query Approach 2.Perform metadata preprocessing/distillation and search in distilled data (much smaller). In my example, CouchDB uses a mechanism called design document to allow database managers to further extract common query-relevant data to a small document (called a view) and the search can be very fast in the views. In my openneuro couchdb database, I created a view with some simple aggregated dataset-level metadata, such as curl -s -X GET https://neurojson.io:7777/openneuro/_design/qq/_view/dbinfo | jq '.rows | .[] | select(.value.info.Authors | .[]? | match("Haxby";"i") )' the curl command downloads the entire Approach 3.Coarse-grained full-text-search can also be done in the design document views given it is relatively small and in machine-readable JSON form. curl -s -X GET https://neurojson.io:7777/openneuro/_design/qq/_view/dbinfo -o openneuro_design.json
cat openneuro_design.json | jq -c '.rows | .[] | {(.id): .value}' | grep -i 'haxby' here, the |
@yarikoptic, I was testing a metadata search prototype over the weekend. you can browse it from use this interface, I could search for some of the questions you asked in datalad/datalad#2257, for example
I just searched the word "visual" in the task descriptors
The backend of this search is a sqlite database with aggregated subject-level metadata extracted from my couchdb databases (currently I have only 3 databases - openneuro, abide-1 and abide-2) with about 37000 subjects. One of the subject-level metadata can be seen here https://neurojson.io:7777/openneuro/_design/qq/_view/subjects these records are automatically extracted by couchdb as a design document, and are merged/converted to an sql database using a cron-job every hour. The total metadata this search tool processes is only about 10MB from the 37000 subjects. This is very scalable as couchdb can handle many databases and everything is setup to run automatically. the only thing that need to be refined is the design document - what type of information should be extracted and is most relevant to the search need. I tried to add the feel free to try it. it is fairly fast because currently the data size is quite small. |
Obviously no search engine is complete if it can't search for trivial keyword/names matching, so my first test is always to search for
Haxby
e.g. here is results on;-)
But then on the other end of spectrum could be very specific and detailed use-cases which might require either elaborate metadata schemas, or use-case specific processing/derivatives computation/extraction/harmonization.
Might be of interest (as evolves) for @fangq
The text was updated successfully, but these errors were encountered: