Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding/Collation problems with 1.5.0 SQL server #834

Closed
soetang opened this issue Aug 23, 2024 · 10 comments · Fixed by #845
Closed

Encoding/Collation problems with 1.5.0 SQL server #834

soetang opened this issue Aug 23, 2024 · 10 comments · Fixed by #845

Comments

@soetang
Copy link

soetang commented Aug 23, 2024

We have been expriencing encoding problems with odbc 1.5.0 on SQL server. This works fine on odbc 1.4.2

All our databases unfortunately uses the varchar collation "Danish_Norwegian_CI_AS" - with odbc 1.4.2 however we were able to create a connection so that we could correctly read and write to the database. With odbc 1.5.0 the column names are no longer formatted correctly.

Is this a bug or can we change the connection settings so that it works correctly?

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  , encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)


DBI::dbWriteTable(
  conn
  , name = test_table
  , df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)

db_data <- DBI::dbReadTable(conn, test_table) %>%
  tibble::as_tibble()

db_data
#> # A tibble: 4 × 2
#>   var_char_col bÃ.vs
#>   <chr>        <dbl>
#> 1 kanin            1
#> 2 ræven            1
#> 3 ålens            1
#> 4 ørred            1

Expected value:

db_data
#> # A tibble: 4 × 2
#>   var_char_col  bøvs
#>   <chr>        <dbl>
#> 1 kanin            1
#> 2 ræven            1
#> 3 ålens            1
#> 4 ørred            1

If we remove the encoding parameter - we get correct colum names, however the character vector no longer fit within the varchar(5) datatype even though we only have 5 characters:

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  # , encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)


DBI::dbWriteTable(
  conn
  , name = test_table
  , df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)
#> Error in eval(expr, envir, enclos): nanodbc/nanodbc.cpp:1783: 00000
#> [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation

db_data <- DBI::dbReadTable(conn, test_table) %>%
  tibble::as_tibble()

db_data
#> # A tibble: 0 × 2
#> # ℹ 2 variables: var_char_col <chr>, bøvs <dbl>
@soetang soetang changed the title Encoding/Collation problems with 1.5.0 Encoding/Collation problems with 1.5.0 SQL server Aug 23, 2024
@simonpcouch
Copy link
Collaborator

One encoding-specific set of changes that made it into 1.5.0 was #531, and they look like they're likely the culprit. @shrektan, any thoughts on which connection parameters should folks now set to see the correct behavior? Or is this unexpected?

@detule
Copy link
Collaborator

detule commented Aug 24, 2024

Hi there.

While waiting for @shrektan to chime in. I am unable to replicate the failure with your second example on my setup. Can you think of how your and my setups are different?

> df <- data.frame(
    var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
    bøvs = 1
  )
> df
  var_char_col bøvs
1        kanin    1
2        ræven    1
3        ålens    1
4        ørred    1
> DBI::dbWriteTable(conn, name = "#deleteme", df, field.types = c('var_char_col' = 'varchar(5)'), overwrite = TRUE)
> DBI::dbReadTable(conn, name = "#deleteme")
  var_char_col bøvs
1        kanin    1
2        ræven    1
3        ålens    1
4        ørred    1

 > conn@encoding
[1] ""
> conn@info
...
$dbms.name
[1] "Microsoft SQL Server"
$db.version
[1] "15.00.4249"
...
$drivername
[1] "libmsodbcsql-17.10.so.6.1"
...
$driver.version
[1] "17.10.0006"
> sessionInfo()$otherPkgs$odbc$Version
[1] "1.5.0.9000"
> sessionInfo()$R.version[c("platform", "arch", "version.string")]
$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$version.string
[1] "R version 4.2.1 (2022-06-23)"

Completely shooting off the hip - are you working against an older version of SQL Server? I think the ability to store UTF-8 encoded characters in VARCHAR columns is only supported starting with SQL Server 2019. If your version is older, and you are not specifying an encoding to the connection argument ( as in your second example ), I suspect you would have success declaring the column NVARCHAR. Only guessing at this point, however.

Need to think how this changed between 1.4.2 and 1.5.0, and how/why your first example worked then and no longer works today.

@detule
Copy link
Collaborator

detule commented Aug 24, 2024

Nevermind @soetang:

Ignore the note above; after creating a catalog with your collation, I am able to replicate your second example as well.

@soetang
Copy link
Author

soetang commented Aug 24, 2024

Thanks for looking into it. Tested some more and teoretically i can get it to work by converting the data myself to and from latin1 and by not providing the encoding argument. This however is what I would expect happened internally when i provide the encoding.

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  #, encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)
df$var_char_col <- stringi::stri_encode(df$var_char_col, to = 'LATIN1')

DBI::dbWriteTable(
  , conn =conn
  , name = test_table
  , value = df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)

db_data <- DBI::dbReadTable(conn, test_table) 

db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2     r\xe6ven    1
#> 3     \xe5lens    1
#> 4     \xf8rred    1

db_data$var_char_col <- stringi::stri_encode(
  db_data$var_char_col
  , from = 'latin1'
  , to = 'UTF-8'
)
db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2        ræven    1
#> 3        ålens    1
#> 4        ørred    1

We are using SQL server 2019:

conn@info
...
#> $dbms.name
#> [1] "Microsoft SQL Server"
#> 
#> $db.version
#> [1] "15.00.4375"
....
#> $drivername
#> [1] "libmsodbcsql-17.10.so.6.1"
#> 
...
#> $driver.version
#> [1] "17.10.0006"
#> 
....

@detule
Copy link
Collaborator

detule commented Aug 25, 2024

Thanks for the investigation.

Here's some more notes from me:

  • I think Microsoft ( SQL Server, OEM driver ) are somewhat inconsistent, especially RE: the AutoTranslate parameter.
    • With it set to no, VAR/CHAR columns are returned across the odbc membrane encoded in a CodePage appropriate to the collation that is used - in your case Windows-1252 ( or latin1). However the column names are coming back as UTF-8.
    • With it set to yes, data and column names are coming back as UTF-8.
      • This means that, I think, when reading data you can omit both the encoding and the AutoTranslate parameter to your connection specification and hopefully everything will work as expected. When writing data, you should keep the encoding parameter; I think when writing data the AutoTranslate parameter is inconsequential.
  • Having said that, I think we should take another look at the part of Encode non-ASCII Column Names to UTF-8 #531 that made it so that when the encoding parameter is specified, it is used to convert both column names and column contents ( previously, only column contents ). I also tried to replicate some of the examples from The Encoding of the column names is not UTF-8 #430 but was unable to - however I think I am missing some important pieces from @shrektan 's environment.

@soetang
Copy link
Author

soetang commented Aug 25, 2024

So I tried writing data while only including the encoding parameter (Without AutoTranslate). This did not end up in a correct result:

image

The same time I just reran my second example but with varchar(6). And while it from R makes both writing and retrieving look correct - the actual data in the database is incorrect:

image

R result:

  library('magrittr')
  conn <- DBI::dbConnect(
    odbc::odbc()
    , Driver = "ODBC Driver 17 for SQL Server"
    , Server = "server"
    , Database = 'database'
    , Trusted_Connection = "Yes"
    # , encoding = 'latin1'
    , AutoTranslate = 'no'
  )
  
  test_table <- DBI::Id(schema = 'test', table = 'test_table')
  df <- tibble::tibble(
    var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
    bøvs = 1
  )
  
  DBI::dbWriteTable(
    conn
    , name = test_table
    , df
    , field.types = c('var_char_col' = 'varchar(6)')
    , overwrite = TRUE
  )
  
  db_data <- DBI::dbReadTable(conn, test_table) 
  
  db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2        ræven    1
#> 3        ålens    1
#> 4        ørred    1

@soetang
Copy link
Author

soetang commented Aug 25, 2024

However I can confirm that for reading the data - it works fine without the encoding and autotranslate arguments.

@detule
Copy link
Collaborator

detule commented Sep 6, 2024

Thanks @soetang

Do you have the ability to test development branches? If so can you see if with #845 we have restored the prior behavior ( everything works with the encoding and AutoTranslate arguments as in your original post ).

@soetang
Copy link
Author

soetang commented Sep 6, 2024

I am off the next days - but have asked my team if they can test it. Thx. Alot.

@soetang
Copy link
Author

soetang commented Sep 10, 2024

@detule I can confirm that it works with your branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants