Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for specifying case for identifiers #663

Merged

Conversation

chrjorgensen
Copy link
Contributor

Add an option identifierCaseto specify what case the identifiers should be converted to - like the keyword case. Possible options are preserver, upper and lower.

Copy link
Collaborator

@nene nene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good start, but needs some more work and thought.

docs/identifierCase.md Outdated Show resolved Hide resolved
@@ -506,4 +506,19 @@ export default class ExpressionFormatter {
return node.text.toLowerCase();
}
}

private showIdentifier(node: IdentifierNode): string {
if (/['"\\`]/.test(node.text[0]) || node.text.startsWith(`U&`)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problematic way to tell the quoted and unquoted identifiers apart. When new types of quoted identifiers are added, one would need to also remember to update this code, which is way too easy to forget. For example, this code will already fail with Transact-SQL [bracket-quoted] identifiers.

The lexer already has two tokens: IDENTIFIER and QUOTED_IDENTIFIER. Currently the parser simply throws this info away, we could instead store into about this inside the IdentifierNode object.

Copy link
Contributor Author

@chrjorgensen chrjorgensen Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I had some trouble identifying quoted identifiers.
Your mention of IDENTIFIER and QUOTED_IDENTIFIER from the lexer was the key!
I followed your suggestion and changed the code to store the token type in the node for identifiers - and now only nodes with token type IDENTIFIER will be converted.

@@ -286,7 +286,7 @@ export default class ExpressionFormatter {
}

private formatIdentifier(node: IdentifierNode) {
this.layout.add(node.text, WS.SPACE);
this.layout.add(this.showIdentifier(node), WS.SPACE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to plain identifiers we also have several identifier-like things:

  • variables
  • parameters

these usually behave just like identifiers and the only thing distinguishing them from normal identifiers is some prefix like @myvar or :myparam.

As the parser currently treats variables as identifiers, these too end up upper/lowercased. But parameters are kepts separate by the parser, so the case of these doesn't change. Should parameters also change together with identifiers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My change is only for unquoted identifiers - I think variables and parameters are outside the scope and should be handled in another PR...


import { FormatFn } from '../../src/sqlFormatter.js';

export default function supportsIdentifierCase(format: FormatFn) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are likely more tests needed here to cover:

  • that all identifiers in expression like foo.bar.baz get uppercased,
  • variables
  • parameters
  • array names (treated differently by lexer)
  • various types of identifier quoting styles

As most of these things depend on the dialect, it might be better to just add additional tests to places like supportsIdentifiers().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for multi-part identifiers in commit 558efa3

`);
});

it('does not uppercase identifiers inside strings', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wouldn't really call it an identifier inside a string - it's just a string.

I personally wouldn't add a separate test for this, I'd just include some strings inside the general "converts identifiers to lowercase/uppercase" test cases. But that's really a personal preference. Doesn't matter much either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String test is now included in identifier case tests and removed as separate test.

@@ -0,0 +1,54 @@
# identifierCase

Converts identifiers to upper or lowercase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to specify exactly what classifies as an identifier and what doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note in commit afe5b48 - is it okay?

Copy link
Collaborator

@nene nene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better now.

Should definitely fix the handling of array[index].

Dealing with the casing of variables and parameters is I think out of scope for this PR. While I believe that this identifierCase option should also cover these, a further work is needed in the lexer and parser to make it possible to distinguish between quoted and unquoted variants of both.

I'll also have to think how should this feature interact with another feature I'd like to have: specifying the casing of functions. Function names are also identifiers. But I'd like to be able to say functionCase: "upper", identifierCase: "lower".

@@ -202,7 +208,7 @@ atomic_expression ->
array_subscript -> %ARRAY_IDENTIFIER _ square_brackets {%
([arrayToken, _, brackets]) => ({
type: NodeType.array_subscript,
array: addComments({ type: NodeType.identifier, text: arrayToken.text}, { trailing: _ }),
array: addComments({ type: NodeType.identifier, tokenType: TokenType.ARRAY_IDENTIFIER, text: arrayToken.text}, { trailing: _ }),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, currently ARRAY_IDENTIFIER tokens are treated differently from normal identifiers. This results in the following SQL:

select foo, foo[1] from tbl

being formatted as:

select
  FOO,
  foo[1]
from
  TBL

That is, when the foo column is used normally, it gets uppercased, but when used together with array-accessor operator it is not uppercased.

This difference between normal identifiers and array-identifiers is really just an internal quirk of SQL Formatter implementation. We should treat both as identifiers and change the case of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array identifiers are now being converted together with ordinary identifiers - see commit 0bb7d21


Note: An identifier is a name of a SQL object.
There are two types of SQL identifiers: ordinary identifiers and quoted identifiers.
Only ordinary identifiers are subject to be converted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, but according to this description I would expect the case of variables and parameters to also be converted, because they too are names of SQL objects.

In general I see all these as different kinds of identifiers:

  • schema, table and column names
  • variable names
  • function names
  • parameter names

Variables are a particularly tricky case. Some dialects like MySQL have variables with a prefix like @foo, and therefore we can easily distinguish them. Other dialects like PostgreSQL also have variables, but there is no special prefix, an idientifier foo could refer to a variable, or it could refer to a table or column.

Copy link
Contributor Author

@chrjorgensen chrjorgensen Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I mixed up variables and parameters... by variables, do you mean SQL variables defined by CREATE VARIABLE (in DB2)? If yes, then I agree that they should also be considered an identifier and converted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, variables like that.

Copy link
Collaborator

@nene nene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nene nene merged commit 47a30d0 into sql-formatter-org:master Nov 13, 2023
2 checks passed
@chrjorgensen chrjorgensen deleted the feature/add-identifier-case branch November 13, 2023 16:22
@karlhorky
Copy link
Contributor

karlhorky commented Nov 26, 2023

Should this case transformation be applied after the generic keywordCase transform?

If it were applied after keywordCase, then it would also resolve this issue:

Consider the following input query:

CREATE TABLE
  actors (
    ID INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    NAME VARCHAR(80) NOT NULL
  );

Currently, by using the configuration identifierCase: 'lower', it will result in the following (actors.id is lowercase, but actors.NAME is still uppercase, because NAME is also a keyword):

CREATE TABLE
  actors (
    id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    NAME VARCHAR(80) NOT NULL
  );

If the identifier transform was applied after the keywordCase transform, then any identifiers which are also keywords would also be lowercased (actors.name would be lowercase):

CREATE TABLE
  actors (
    id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    name VARCHAR(80) NOT NULL
  );

Alternatives

Alternatives from #156 (comment) would be:

  1. accepting an object for the option keywordCase, for granular configuration:
const options = {
  language: 'postgresql',
  keywordCase: {
    upper: true, // true: default case
    lower: ['name', 'varchar', 'integer'],
    preserve: ['in'],
  },
}
  1. an option keywordCaseIgnore (array), which would allow for more flexible customization (would not allow for same configurability, I prefer option 1):
const options = {
  language: 'postgresql',
  keywordCase: 'upper',
  keywordCaseIgnore: [
    // Avoid changing case of `name` fields in tables
    'name',
    // Avoid changing case of data types
    'varchar',
    'integer',
    // ...
  ],
}

@nene
Copy link
Collaborator

nene commented Nov 27, 2023

@karlhorky there is no before or after here. The formatter does not do multiple passes of applying the format. It just detects some words as keywords and some as identifiers. And then these get converted to upper/lower case if configured so. It will never be the case that a word is considered both identifier and keyword by the formatter.

@karlhorky
Copy link
Contributor

Ok understood, thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants