Throw an error when a unicode character is parsed in a identifier #2129

picnoir · 2018-05-01T15:24:35Z

I am a bit unsure about my approach here, but I couldn't come with anything better... I am open to any suggestion :)

Before

nix-repl> let a é 1; in a + 1  
       ...

nix-repl>  let a = 1; in aé + 1  
1

#After

nix-repl> let a é 1; in a + 1      
error: Cannot use the unicode character 'é' in a nix expression.

nix-repl> let a = 1; in aé + 1  
error: Cannot use the unicode character 'é' in a nix expression.

Fixes NixOS#1374.

edolstra · 2018-05-01T15:56:08Z

Hm, I don't understand why the lexer is currently accepting those characters at all. The only token type that matches in ANY (passing on the character to the parser), but nothing in parser.y accepts them either. I would expect syntax error, unexpected $undefined, as in

nix-repl> a & b
error: syntax error, unexpected $undefined, expecting $end, at (string):1:3

picnoir · 2018-05-02T13:15:32Z

I don't get it either.

Here's the bison trace of the master branch:

nix-repl>  let a = 1; in aéa + 1                
Starting parse
Entering state 0
Reading a token: Next token is token LET (1.2-4: )
Shifting token LET (1.2-4: )
Entering state 11
Reading a token: Next token is token ID (1.6: )
Reducing stack 0 by rule 65 (line 442):
-> $$ = nterm binds (1.5: )
Entering state 34
Next token is token ID (1.6: )
Shifting token ID (1.6: )
Entering state 79
Reducing stack 0 by rule 73 (line 484):
   $1 = token ID (1.6: )
-> $$ = nterm attr (1.6: )
Entering state 86
Reducing stack 0 by rule 71 (line 471):
   $1 = nterm attr (1.6: )
-> $$ = nterm attrpath (1.6: )
Entering state 85
Reading a token: Next token is token '=' (1.8: )
Shifting token '=' (1.8: )
Entering state 132
Reading a token: Next token is token INT (1.10: )
Shifting token INT (1.10: )
Entering state 2
Reducing stack 0 by rule 39 (line 371):
   $1 = token INT (1.10: )
-> $$ = nterm expr_simple (1.10: )
Entering state 27
Reading a token: Next token is token ';' (1.11: )
Reducing stack 0 by rule 37 (line 361):
   $1 = nterm expr_simple (1.10: )
-> $$ = nterm expr_select (1.10: )
Entering state 26
Reducing stack 0 by rule 33 (line 349):
   $1 = nterm expr_select (1.10: )
-> $$ = nterm expr_app (1.10: )
Entering state 25
Next token is token ';' (1.11: )
Reducing stack 0 by rule 31 (line 343):
   $1 = nterm expr_app (1.10: )
-> $$ = nterm expr_op (1.10: )
Entering state 24
Next token is token ';' (1.11: )
Reducing stack 0 by rule 12 (line 320):
   $1 = nterm expr_op (1.10: )
-> $$ = nterm expr_if (1.10: )
Entering state 23
Reducing stack 0 by rule 10 (line 315):
   $1 = nterm expr_if (1.10: )
-> $$ = nterm expr_function (1.10: )
Entering state 22
Reducing stack 0 by rule 2 (line 294):
   $1 = nterm expr_function (1.10: )
-> $$ = nterm expr (1.10: )
Entering state 153
Next token is token ';' (1.11: )
Shifting token ';' (1.11: )
Entering state 163
Reducing stack 0 by rule 62 (line 423):
   $1 = nterm binds (1.5: )
   $2 = nterm attrpath (1.6: )
   $3 = token '=' (1.8: )
   $4 = nterm expr (1.10: )
   $5 = token ';' (1.11: )
-> $$ = nterm binds (1.5-11: )
Entering state 34
Reading a token: Next token is token IN (1.13-14: )
Shifting token IN (1.13-14: )
Entering state 80
Reading a token: Next token is token ID (1.16: )
Shifting token ID (1.16: )
Entering state 1
Reading a token: Now at end of input.
Reducing stack 0 by rule 38 (line 365):
   $1 = token ID (1.16: )
-> $$ = nterm expr_simple (1.16: )
Entering state 27
Now at end of input.
Reducing stack 0 by rule 37 (line 361):
   $1 = nterm expr_simple (1.16: )
-> $$ = nterm expr_select (1.16: )
Entering state 26
Reducing stack 0 by rule 33 (line 349):
   $1 = nterm expr_select (1.16: )
-> $$ = nterm expr_app (1.16: )
Entering state 25
Now at end of input.
Reducing stack 0 by rule 31 (line 343):
   $1 = nterm expr_app (1.16: )
-> $$ = nterm expr_op (1.16: )
Entering state 24
Now at end of input.
Reducing stack 0 by rule 12 (line 320):
   $1 = nterm expr_op (1.16: )
-> $$ = nterm expr_if (1.16: )
Entering state 23
Reducing stack 0 by rule 10 (line 315):
   $1 = nterm expr_if (1.16: )
-> $$ = nterm expr_function (1.16: )
Entering state 126
Reducing stack 0 by rule 9 (line 309):
   $1 = token LET (1.2-4: )
   $2 = nterm binds (1.5-11: )
   $3 = token IN (1.13-14: )
   $4 = nterm expr_function (1.16: )
-> $$ = nterm expr_function (1.2-16: )
Entering state 22
Reducing stack 0 by rule 2 (line 294):
   $1 = nterm expr_function (1.2-16: )
-> $$ = nterm expr (1.2-16: )
Entering state 21
Reducing stack 0 by rule 1 (line 292):
   $1 = nterm expr (1.2-16: )
-> $$ = nterm start (1.2-16: )
Entering state 20
Now at end of input.
Shifting token $end (1.17: )
Entering state 53
Cleanup: popping token $end (1.17: )
Cleanup: popping nterm start (1.2-16: )
1

I really don't get that part:

[...]
Reading a token: Next token is token ID (1.16: )
Shifting token ID (1.16: )
Entering state 1
Reading a token: Now at end of input.
[...]

I don't understand why the lexer is consuming the rest of the input as soon as it reach a unicode char. Isn't the ANY match supposed to consume the chars one by one?

As I said in the PR, I did not find anything better than explicitly throwing an error while lexing a unicode identifier.

dezgeg · 2018-05-10T17:04:56Z

A random guess: missing %option 8bit from the lexer?

edolstra · 2018-05-11T09:31:32Z

That didn't help unfortunately. Apparently 8-bit is the default unless you're using certain table compression options.

lexer: Throw an error when utf8 in ID.

57059a7

Fixes NixOS#1374.

edolstra closed this in 1ad1923 May 11, 2018

picnoir deleted the unicodeIdentifiers branch May 11, 2018 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw an error when a unicode character is parsed in a identifier #2129

Throw an error when a unicode character is parsed in a identifier #2129

picnoir commented May 1, 2018

edolstra commented May 1, 2018

picnoir commented May 2, 2018

dezgeg commented May 10, 2018

edolstra commented May 11, 2018

Throw an error when a unicode character is parsed in a identifier #2129

Throw an error when a unicode character is parsed in a identifier #2129

Conversation

picnoir commented May 1, 2018

Before

edolstra commented May 1, 2018

picnoir commented May 2, 2018

dezgeg commented May 10, 2018

edolstra commented May 11, 2018