Encoding/Collation problems with 1.5.0 SQL server #834

soetang · 2024-08-23T09:05:56Z

We have been expriencing encoding problems with odbc 1.5.0 on SQL server. This works fine on odbc 1.4.2

All our databases unfortunately uses the varchar collation "Danish_Norwegian_CI_AS" - with odbc 1.4.2 however we were able to create a connection so that we could correctly read and write to the database. With odbc 1.5.0 the column names are no longer formatted correctly.

Is this a bug or can we change the connection settings so that it works correctly?

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  , encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)


DBI::dbWriteTable(
  conn
  , name = test_table
  , df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)

db_data <- DBI::dbReadTable(conn, test_table) %>%
  tibble::as_tibble()

db_data
#> # A tibble: 4 × 2
#>   var_char_col bÃ.vs
#>   <chr>        <dbl>
#> 1 kanin            1
#> 2 ræven            1
#> 3 ålens            1
#> 4 ørred            1

Expected value:

db_data
#> # A tibble: 4 × 2
#>   var_char_col  bøvs
#>   <chr>        <dbl>
#> 1 kanin            1
#> 2 ræven            1
#> 3 ålens            1
#> 4 ørred            1

If we remove the encoding parameter - we get correct colum names, however the character vector no longer fit within the varchar(5) datatype even though we only have 5 characters:

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  # , encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)


DBI::dbWriteTable(
  conn
  , name = test_table
  , df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)
#> Error in eval(expr, envir, enclos): nanodbc/nanodbc.cpp:1783: 00000
#> [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation

db_data <- DBI::dbReadTable(conn, test_table) %>%
  tibble::as_tibble()

db_data
#> # A tibble: 0 × 2
#> # ℹ 2 variables: var_char_col <chr>, bøvs <dbl>

simonpcouch · 2024-08-23T17:46:09Z

One encoding-specific set of changes that made it into 1.5.0 was #531, and they look like they're likely the culprit. @shrektan, any thoughts on which connection parameters should folks now set to see the correct behavior? Or is this unexpected?

detule · 2024-08-24T18:26:28Z

Hi there.

While waiting for @shrektan to chime in. I am unable to replicate the failure with your second example on my setup. Can you think of how your and my setups are different?

> df <- data.frame(
    var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
    bøvs = 1
  )
> df
  var_char_col bøvs
1        kanin    1
2        ræven    1
3        ålens    1
4        ørred    1
> DBI::dbWriteTable(conn, name = "#deleteme", df, field.types = c('var_char_col' = 'varchar(5)'), overwrite = TRUE)
> DBI::dbReadTable(conn, name = "#deleteme")
  var_char_col bøvs
1        kanin    1
2        ræven    1
3        ålens    1
4        ørred    1

 > conn@encoding
[1] ""
> conn@info
...
$dbms.name
[1] "Microsoft SQL Server"
$db.version
[1] "15.00.4249"
...
$drivername
[1] "libmsodbcsql-17.10.so.6.1"
...
$driver.version
[1] "17.10.0006"
> sessionInfo()$otherPkgs$odbc$Version
[1] "1.5.0.9000"
> sessionInfo()$R.version[c("platform", "arch", "version.string")]
$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$version.string
[1] "R version 4.2.1 (2022-06-23)"

Completely shooting off the hip - are you working against an older version of SQL Server? I think the ability to store UTF-8 encoded characters in VARCHAR columns is only supported starting with SQL Server 2019. If your version is older, and you are not specifying an encoding to the connection argument ( as in your second example ), I suspect you would have success declaring the column NVARCHAR. Only guessing at this point, however.

Need to think how this changed between 1.4.2 and 1.5.0, and how/why your first example worked then and no longer works today.

detule · 2024-08-24T19:12:20Z

Nevermind @soetang:

Ignore the note above; after creating a catalog with your collation, I am able to replicate your second example as well.

soetang · 2024-08-24T22:14:30Z

Thanks for looking into it. Tested some more and teoretically i can get it to work by converting the data myself to and from latin1 and by not providing the encoding argument. This however is what I would expect happened internally when i provide the encoding.

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  #, encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)
df$var_char_col <- stringi::stri_encode(df$var_char_col, to = 'LATIN1')

DBI::dbWriteTable(
  , conn =conn
  , name = test_table
  , value = df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)

db_data <- DBI::dbReadTable(conn, test_table) 

db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2     r\xe6ven    1
#> 3     \xe5lens    1
#> 4     \xf8rred    1

db_data$var_char_col <- stringi::stri_encode(
  db_data$var_char_col
  , from = 'latin1'
  , to = 'UTF-8'
)
db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2        ræven    1
#> 3        ålens    1
#> 4        ørred    1

We are using SQL server 2019:

conn@info
...
#> $dbms.name
#> [1] "Microsoft SQL Server"
#> 
#> $db.version
#> [1] "15.00.4375"
....
#> $drivername
#> [1] "libmsodbcsql-17.10.so.6.1"
#> 
...
#> $driver.version
#> [1] "17.10.0006"
#> 
....

detule · 2024-08-25T12:21:55Z

Thanks for the investigation.

Here's some more notes from me:

I think Microsoft ( SQL Server, OEM driver ) are somewhat inconsistent, especially RE: the AutoTranslate parameter.
- With it set to no, VAR/CHAR columns are returned across the odbc membrane encoded in a CodePage appropriate to the collation that is used - in your case Windows-1252 ( or latin1). However the column names are coming back as UTF-8.
- With it set to yes, data and column names are coming back as UTF-8.
  - This means that, I think, when reading data you can omit both the encoding and the AutoTranslate parameter to your connection specification and hopefully everything will work as expected. When writing data, you should keep the encoding parameter; I think when writing data the AutoTranslate parameter is inconsequential.
Having said that, I think we should take another look at the part of Encode non-ASCII Column Names to UTF-8 #531 that made it so that when the encoding parameter is specified, it is used to convert both column names and column contents ( previously, only column contents ). I also tried to replicate some of the examples from The Encoding of the column names is not UTF-8 #430 but was unable to - however I think I am missing some important pieces from @shrektan 's environment.

soetang · 2024-08-25T19:56:14Z

So I tried writing data while only including the encoding parameter (Without AutoTranslate). This did not end up in a correct result:

The same time I just reran my second example but with varchar(6). And while it from R makes both writing and retrieving look correct - the actual data in the database is incorrect:

R result:

  library('magrittr')
  conn <- DBI::dbConnect(
    odbc::odbc()
    , Driver = "ODBC Driver 17 for SQL Server"
    , Server = "server"
    , Database = 'database'
    , Trusted_Connection = "Yes"
    # , encoding = 'latin1'
    , AutoTranslate = 'no'
  )
  
  test_table <- DBI::Id(schema = 'test', table = 'test_table')
  df <- tibble::tibble(
    var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
    bøvs = 1
  )
  
  DBI::dbWriteTable(
    conn
    , name = test_table
    , df
    , field.types = c('var_char_col' = 'varchar(6)')
    , overwrite = TRUE
  )
  
  db_data <- DBI::dbReadTable(conn, test_table) 
  
  db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2        ræven    1
#> 3        ålens    1
#> 4        ørred    1

soetang · 2024-08-25T19:59:04Z

However I can confirm that for reading the data - it works fine without the encoding and autotranslate arguments.

detule · 2024-09-06T01:50:57Z

Thanks @soetang

Do you have the ability to test development branches? If so can you see if with #845 we have restored the prior behavior ( everything works with the encoding and AutoTranslate arguments as in your original post ).

soetang · 2024-09-06T08:39:19Z

I am off the next days - but have asked my team if they can test it. Thx. Alot.

soetang · 2024-09-10T09:03:47Z

@detule I can confirm that it works with your branch.

soetang changed the title ~~Encoding/Collation problems with 1.5.0~~ Encoding/Collation problems with 1.5.0 SQL server Aug 23, 2024

simonpcouch added the encoding 🔤 label Aug 23, 2024

detule mentioned this issue Sep 10, 2024

Feature/independent column name encoding #845

Merged

2 tasks

detule closed this as completed in #845 Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding/Collation problems with 1.5.0 SQL server #834

Encoding/Collation problems with 1.5.0 SQL server #834

soetang commented Aug 23, 2024 •

edited

Loading

simonpcouch commented Aug 23, 2024

detule commented Aug 24, 2024

detule commented Aug 24, 2024

soetang commented Aug 24, 2024 •

edited

Loading

detule commented Aug 25, 2024

soetang commented Aug 25, 2024

soetang commented Aug 25, 2024

detule commented Sep 6, 2024

soetang commented Sep 6, 2024

soetang commented Sep 10, 2024

Encoding/Collation problems with 1.5.0 SQL server #834

Encoding/Collation problems with 1.5.0 SQL server #834

Comments

soetang commented Aug 23, 2024 • edited Loading

simonpcouch commented Aug 23, 2024

detule commented Aug 24, 2024

detule commented Aug 24, 2024

soetang commented Aug 24, 2024 • edited Loading

detule commented Aug 25, 2024

soetang commented Aug 25, 2024

soetang commented Aug 25, 2024

detule commented Sep 6, 2024

soetang commented Sep 6, 2024

soetang commented Sep 10, 2024

soetang commented Aug 23, 2024 •

edited

Loading

soetang commented Aug 24, 2024 •

edited

Loading