Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality for MR metadata reading from SAV #313

Merged

Conversation

slobodan-ilic
Copy link
Contributor

@slobodan-ilic slobodan-ilic commented Apr 24, 2024

This PR adds functionality for reading multiple response metadata from sav files. It's been tested on a simple file that we use for PoC in Crunch.io. It's a work in progress, I'm available for any updates and changes that need to be done to it.

UPDATE: Running this function with a test file succeeds a small portion of the tires. However it throws segfaults on most tries, I can't really track it down, so any guidance on that more than welcome as well.

Here's the example on how I tried testing it:

include <stdlib.h>
#include "readstat.h"

typedef struct
{
    const mr_set_t *sets;
    int count;
} mr_sets_context_t;

int handle_metadata(readstat_metadata_t *metadata, void *ctx)
{
    mr_sets_context_t *mr_ctx = (mr_sets_context_t *)ctx; // Cast to non-const
    mr_ctx->count = readstat_get_multiple_response_sets_length(metadata);
    mr_ctx->sets = readstat_get_mr_sets(metadata);
    return READSTAT_HANDLER_OK;
}

int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);

    // Processing
    mr_sets_context_t *mr_ctx = malloc(sizeof(mr_sets_context_t));
    error = readstat_parse_sav(parser, argv[1], mr_ctx);
    printf("Found %d records\n", mr_ctx->count);
    for (int i = 0; i < mr_ctx->count; i++)
    {
        printf("MR set %d name: %s\n", i + 1, mr_ctx->sets[i].name);
        printf("type: %c\n", mr_ctx->sets[i].type);
        printf("is dichotomy: %d\n", mr_ctx->sets[i].is_dichotomy);
    }

    // Cleanup
    readstat_parser_free(parser);
    if (error != READSTAT_OK)
    {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    return 0;
}

And here's the example file:
simple_alltypes.sav.zip

This is the output from when it succeeds:

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
Found 2 records
MR set 1 name: categorical_array
type: C
is dichotomy: 0
MR set 2 name: mymrset
type: D
is dichotomy: 1

and this one is when it fails (which happens more often):

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
[1]    86961 segmentation fault  DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav

@evanmiller
Copy link
Contributor

See failing builds also

src/spss/readstat_sav_read.c: In function ‘parse_mr_line’:
src/spss/readstat_sav_read.c:176:51: error: implicit declaration of function ‘isdigit’ [-Wimplicit-function-declaration]
  176 |             for (int i = 0; i < internal_count && isdigit(*next_part); i++) {
      |                                                   ^~~~~~~
src/spss/readstat_sav_read.c:26:1: note: include ‘<ctype.h>’ or provide a declaration of ‘isdigit’
   25 | #include "readstat_zsav_read.h"
  +++ |+#include <ctype.h>
   26 | #endif
src/spss/readstat_sav_read.c: In function ‘readstat_parse_sav’:
src/spss/readstat_sav_read.c:1882:40: error: implicit declaration of function ‘toupper’ [-Wimplicit-function-declaration]
 1882 |                     sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);
      |                                        ^~~~~~~
src/spss/readstat_sav_read.c:1882:40: note: include ‘<ctype.h>’ or provide a declaration of ‘toupper’
make: *** [Makefile:2419: src/spss/libreadstat_la-readstat_sav_read.lo] Error 1

@slobodan-ilic
Copy link
Contributor Author

@evanmiller I think it's fixed now.

@evanmiller
Copy link
Contributor

@slobodan-ilic Thanks for addressing the build issues. However, it looks like CI Fuzzer uncovered a segfault. From a cursory read of the code, it appears that strtol performs an unprotected memory read. There is also a Windows build issue that will need to be addressed.

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 019f4a6 to 850f0df Compare May 6, 2024 10:36
@slobodan-ilic
Copy link
Contributor Author

Hi @evanmiller, thanks for all the input. We've done a couple of iterations and a bunch of testing on real-life survey data. All of the small bugs are taken care of, no more nasty segfaults, etc. Are you available to do one more round of review and provide some guidance?

@slobodan-ilic
Copy link
Contributor Author

Another short update: managed to run fuzzers locally (even though the documentation didn't work in a straightforward path). After managing to produce a crash locally - identified and fixed the bug (which seemed obvious once discovered). Should be in much better shape now.

@evanmiller
Copy link
Contributor

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from c074f51 to ec778f5 Compare June 13, 2024 16:49
@slobodan-ilic
Copy link
Contributor Author

slobodan-ilic commented Jun 13, 2024

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

I've just detected the other fuzz failure as you were writing... Should be fixed now. About the windows tho, are you referring to the errors with readstat_sav_date, the ones that mostly have VS17 in the paths of the files? If so, I've tried reverting my PR back to dev branch, but these errors are still present in the CI. I thought I'd avoid trying to get VS up and running, since I'm on mac.

Maybe I should try opening the PR against master?

update: Well I just tried with master too (as a separate commit which I later deleted). Was the same error about sav date, all red in the appveyor run, but it said the tests passed. Unknown land to me :)

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 430b30f to ec778f5 Compare June 13, 2024 17:24
@slobodan-ilic
Copy link
Contributor Author

Found another issue with fuzzer, this time it's an OOM. On it, will ping when done.

@evanmiller
Copy link
Contributor

For Windows I am referring to the failed CI

In file included from src/spss/readstat_sav_read.c:11:
src/spss/readstat_sav_read.c: In function 'parse_mr_counted_value':
src/spss/readstat_sav_read.c:167:55: error: array subscript has type 'char' [-Werror=char-subscripts]
  167 |         for (int i = 0; i < internal_count && isdigit(*(*next_part)); i++) {
      |                                                       ^~~~~~~~~~~~~
src/spss/readstat_sav_read.c: In function 'parse_mr_line':
src/spss/readstat_sav_read.c:210:20: error: array subscript has type 'char' [-Werror=char-subscripts]
  210 |     while (isdigit(*next_part)) {
      |                    ^~~~~~~~~~

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 12 times, most recently from 815d40d to de0551c Compare June 16, 2024 09:17
@evanmiller
Copy link
Contributor

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 55ffacc to 5fceabb Compare September 19, 2024 17:55
@evanmiller
Copy link
Contributor

It looks like the fuzzer detected a memory leak

https://github.com/WizardMac/ReadStat/actions/runs/10946324658/job/30395085028?pr=313

@slobodan-ilic
Copy link
Contributor Author

Thx, will check over the weekend...

@evanmiller evanmiller merged commit ba4392e into WizardMac:dev Sep 22, 2024
11 of 12 checks passed
@evanmiller
Copy link
Contributor

Merged!

@slobodan-ilic
Copy link
Contributor Author

wow!!! Thanks @evanmiller ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants