Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default encoding is UTF-8? #64

Open
InvncibiltyCloak opened this issue Jan 17, 2024 · 4 comments
Open

Default encoding is UTF-8? #64

InvncibiltyCloak opened this issue Jan 17, 2024 · 4 comments

Comments

@InvncibiltyCloak
Copy link

First off, thanks for the great Dewesoft reader library.
I was recently using it for my datafiles which are DXD and are created on a Windows x64, en-US machine.

The units had some unicode characters for degree symbol and ohms. When I imported it with this library it had the classic Å symbol which is the give away of reading UTF-8 binary data but assuming it should be decoded according to Windows codepage (looks like you have ISO-8859-1 chosen).

A quick peek into the python code and I saw this is extremely easy to fix in this library - just call dwdatareader.encoding = 'utf-8' and it gives the correctly decoded strings.

I just wanted to file an issue to bring up the fact that it appears that DewesoftX is encoding strings in UTF-8 and perhaps this library should change the default encoding to match?

Unfortunately I am only sample size of one and have not tested other locales or versions of Dewesoft, so I am not sure if this default encoding applies everywhere. Thanks for your time!

@costerwi
Copy link
Owner

Thanks for your comments! I'm glad you found it easy to override the encoding.

I cannot find the encoding documented anywhere. The default was set to ISO-8859-1 a long time ago, probably due to an observation like yours. It may have evolved since then. The fact that your Windows machine seems to be recording in UTF-8 seems to be good reason to change the assumed default to UTF-8.

@fleimgruber
Copy link
Contributor

Thanks @InvncibiltyCloak for bringing this up. Changing the default encoding to UTF-8 seems reasonable. One consideration though would be to give users the option to explicitly set encodings to maintain backwards compatibility with other encodings, e.g. ISO-8859-1, in older files and with older DEWE stacks?

@costerwi
Copy link
Owner

costerwi commented Nov 3, 2024

I never had a good example to test the encoding so it is intentionally very easy for the user to specify:

import dewesoft as dw
dw.encoding='utf-8'

Unfortunately, the Dewesoft library sometimes appends junk characters to the end of strings which cause utf-8 decoding errors in python and fail the tests. If we change the default to utf-8 then we need to either ask Dewesoft fix their library or have python ignore these decoding errors.

@fleimgruber
Copy link
Contributor

Ah I should have been more specific. I saw this global option, but wondered if all of the 10 or so usages of it should all use the same encoding, e.g. opening the file in

stat = DLL.DWOpenDataFile(self.name.encode(encoding=encoding), ctypes.byref(self.info))

vs decoding text values e.g. in

return self._unit.decode(encoding=encoding)

But it was only guessing on my part without any evidence of different encodings actually occurring.

Unfortunately, the Dewesoft library sometimes appends junk characters to the end of strings which cause utf-8 decoding errors in python and fail the tests. If we change the default to utf-8 then we need to either ask Dewesoft fix their library or have python ignore these decoding errors.

That sounds annoying. I would guess that the junk characters are a result of the C lib interpreting parts of the memory as strings when it should not, i.e. string length mismatch at that level?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants