Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File start/end offset issue for VB file #647

Open
D3v3sh5ingh opened this issue Nov 30, 2023 · 5 comments
Open

File start/end offset issue for VB file #647

D3v3sh5ingh opened this issue Nov 30, 2023 · 5 comments
Labels
question Further information is requested

Comments

@D3v3sh5ingh
Copy link

Hi @yruslan

Issue : 643

File_start_offset and File_end_offset options for VB files are not working and throwing the same error as posted in issue 643.
I have a file with both RDW and BDW (Record Format VB) . The file is with header and footer also.
I want to skip first few bytes of header and last few bytes of footer.
For that using options file_start_offset and file_end_offset but getting the similar error as in issue 643.

@D3v3sh5ingh D3v3sh5ingh added the bug Something isn't working label Nov 30, 2023
@yruslan
Copy link
Collaborator

yruslan commented Nov 30, 2023

Hi @D3v3sh5ingh, what's your high level offset layout?

For example:
0 - 19 Headers (to be ignored)
20 - 23 BDW
24 - 27 RDW
28 - 99 Payload
100 - 193 RDW
...
32000 Payload
32093 Footer (to be ignored)

@D3v3sh5ingh
Copy link
Author

Hi @yruslan
My high level layout looks like below:
BDW { RDW 45 bytes , RDW 1000 bytes, RDW 1000 bytes , RDW 1000 bytes ....}
BDW { RDW 1000 bytes .....}
......
BDW { RDW 1000 bytes...., RDW 45 bytes}

45 bytes of header and trailer are inside the BDW as shown above.
We want to remove these 45 bytes of header and trailer present in the file.

@yruslan
Copy link
Collaborator

yruslan commented Nov 30, 2023

file_start_offset and file_end_offset work on the level of file, e.g. cases like:
HEDAER {45 bytes} BDW { RDW 1000 bytes, RDW 1000 bytes, RDW 1000 bytes , RDW 1000 bytes ....}

Since your 45 headers are part of record payload you can't do it using these options. What you can do is you can add the header as a redefine segment in your copybook, and then you can filter it out after you get the dataframe.

The copybook will looks like this:

01   RECORD.
   05  HEDAER.
        10 CONTENT X(45).
   05 PAYLOAD REDEFINES HEADER.
   ... your payload goes at level 10 here

@D3v3sh5ingh
Copy link
Author

Hi ,
This is a sample output for my file . 45 bytes that i want to skip are at the start and at the end only . Not in each record.
If I don't use the file _start_offset and file_end_offset , i am able to get above dataframe as output but I am getting two extra records(Header and Trailer).
But if I use these options with 45 bytes , i face an error ( length of BDW block is too big ) .

IMG-20231130-WA0007

@yruslan
Copy link
Collaborator

yruslan commented Dec 1, 2023

Options 'file_start_offset' and 'file_end_offset' only drop bytes from the beginning or at the end of files, not from the payload. This is the expected behavior.

There are no options that allow dropping bytes from inside records, so possible solutions are:

  • If you need to keep these special 45-byte records, you can use the modified copybook solution above.
  • (probably your case) If you want to ignore these special 45-byte records, just remove these records in post-processing, e.g. df.filter(col("COL1").isNotNull)

@yruslan yruslan added question Further information is requested and removed bug Something isn't working labels Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants