Calculate XPT number_rows using metadata and final chunk #261

gerrycampion · 2024-04-30T15:37:36Z

Describe the issue
According to the documentation for xpt metadata, number_rows cannot be determined unless the entire dataset is read. I understand that number_rows cannot be extracted from the metadata alone, but I think it can be calculated using only the metadata and final 80-byte chunk.

Expected behavior

Read the header information to find: variable_storage_widths and the start of record data
Calculate record_storage_width as sum of variable_storage_widths
Read the last 80-byte chunk of data to find out how much trailing ASCII blank padding there is.
Calculate number of records using:
(total_file_size - start - padding) / record_storage_width

The text was updated successfully, but these errors were encountered:

ofajardo · 2024-04-30T18:39:01Z

Thanks for the interesting suggestion. Pyreadstat is a wrapper around the C library ReadStat, new functionality has to be implemented there before I can expose that functionality here. I do not think that ReadStat has functions to return the start of the data or the padding, so the xalculation xannot be done right now, but you can suggest it over there and once implemented, I can wrap it and provide it in Pyreadstat.

gerrycampion · 2024-04-30T21:09:28Z

WizardMac/ReadStat#315

measiala · 2024-05-01T15:44:35Z

I believe that the number of rows is available for v8 XPORT files created at least for SAS v9.0401M8. This is causing an issue with readstat-created v8 XPORT files from being read by this version of SAS as readstat does not provide the observation count but SAS is expecting it.

Unfortunately, this revised layout does not appear to be documented in the official v8/v9 XPORT layout documentation released by SAS in Oct 2021.

I am currently trying to test the changes necessary to the readstat code to, first of all, write the file. Then there could be some optional code to read in that metadata from the XPORT observation header.

I'll try to get this posted to the readstat site as a new issue (and ideally a PR) soonish once I finish testing "in my spare time". :)

-- Edit: This has been posted as issue #316. I included a blurb about reading in the observation count when available. This could be a partial solution to your issue.

gerrycampion changed the title ~~Calculate number_rows using metadata and final chunk~~ Calculate XPT number_rows using metadata and final chunk Apr 30, 2024

ofajardo added the requires changes in Readstat waiting for changes in the C library Readstat label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate XPT number_rows using metadata and final chunk #261

Calculate XPT number_rows using metadata and final chunk #261

gerrycampion commented Apr 30, 2024

ofajardo commented Apr 30, 2024

gerrycampion commented Apr 30, 2024

measiala commented May 1, 2024 •

edited

Loading

Calculate XPT number_rows using metadata and final chunk #261

Calculate XPT number_rows using metadata and final chunk #261

Comments

gerrycampion commented Apr 30, 2024

ofajardo commented Apr 30, 2024

gerrycampion commented Apr 30, 2024

measiala commented May 1, 2024 • edited Loading

measiala commented May 1, 2024 •

edited

Loading