Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw an error when a unicode character is parsed in a identifier #2129

Closed
wants to merge 1 commit into from

Conversation

picnoir
Copy link
Member

@picnoir picnoir commented May 1, 2018

Fixes #1374.

I am a bit unsure about my approach here, but I couldn't come with anything better... I am open to any suggestion :)

Before

nix-repl> let a é 1; in a + 1  
       ...

nix-repl>  let a = 1; in aé + 1  
1

#After

nix-repl> let a é 1; in a + 1      
error: Cannot use the unicode character 'é' in a nix expression.

nix-repl> let a = 1; in aé + 1  
error: Cannot use the unicode character 'é' in a nix expression.

@edolstra
Copy link
Member

edolstra commented May 1, 2018

Hm, I don't understand why the lexer is currently accepting those characters at all. The only token type that matches in ANY (passing on the character to the parser), but nothing in parser.y accepts them either. I would expect syntax error, unexpected $undefined, as in

nix-repl> a & b
error: syntax error, unexpected $undefined, expecting $end, at (string):1:3

@picnoir
Copy link
Member Author

picnoir commented May 2, 2018

I don't get it either.

Here's the bison trace of the master branch:

nix-repl>  let a = 1; in aéa + 1                
Starting parse
Entering state 0
Reading a token: Next token is token LET (1.2-4: )
Shifting token LET (1.2-4: )
Entering state 11
Reading a token: Next token is token ID (1.6: )
Reducing stack 0 by rule 65 (line 442):
-> $$ = nterm binds (1.5: )
Entering state 34
Next token is token ID (1.6: )
Shifting token ID (1.6: )
Entering state 79
Reducing stack 0 by rule 73 (line 484):
   $1 = token ID (1.6: )
-> $$ = nterm attr (1.6: )
Entering state 86
Reducing stack 0 by rule 71 (line 471):
   $1 = nterm attr (1.6: )
-> $$ = nterm attrpath (1.6: )
Entering state 85
Reading a token: Next token is token '=' (1.8: )
Shifting token '=' (1.8: )
Entering state 132
Reading a token: Next token is token INT (1.10: )
Shifting token INT (1.10: )
Entering state 2
Reducing stack 0 by rule 39 (line 371):
   $1 = token INT (1.10: )
-> $$ = nterm expr_simple (1.10: )
Entering state 27
Reading a token: Next token is token ';' (1.11: )
Reducing stack 0 by rule 37 (line 361):
   $1 = nterm expr_simple (1.10: )
-> $$ = nterm expr_select (1.10: )
Entering state 26
Reducing stack 0 by rule 33 (line 349):
   $1 = nterm expr_select (1.10: )
-> $$ = nterm expr_app (1.10: )
Entering state 25
Next token is token ';' (1.11: )
Reducing stack 0 by rule 31 (line 343):
   $1 = nterm expr_app (1.10: )
-> $$ = nterm expr_op (1.10: )
Entering state 24
Next token is token ';' (1.11: )
Reducing stack 0 by rule 12 (line 320):
   $1 = nterm expr_op (1.10: )
-> $$ = nterm expr_if (1.10: )
Entering state 23
Reducing stack 0 by rule 10 (line 315):
   $1 = nterm expr_if (1.10: )
-> $$ = nterm expr_function (1.10: )
Entering state 22
Reducing stack 0 by rule 2 (line 294):
   $1 = nterm expr_function (1.10: )
-> $$ = nterm expr (1.10: )
Entering state 153
Next token is token ';' (1.11: )
Shifting token ';' (1.11: )
Entering state 163
Reducing stack 0 by rule 62 (line 423):
   $1 = nterm binds (1.5: )
   $2 = nterm attrpath (1.6: )
   $3 = token '=' (1.8: )
   $4 = nterm expr (1.10: )
   $5 = token ';' (1.11: )
-> $$ = nterm binds (1.5-11: )
Entering state 34
Reading a token: Next token is token IN (1.13-14: )
Shifting token IN (1.13-14: )
Entering state 80
Reading a token: Next token is token ID (1.16: )
Shifting token ID (1.16: )
Entering state 1
Reading a token: Now at end of input.
Reducing stack 0 by rule 38 (line 365):
   $1 = token ID (1.16: )
-> $$ = nterm expr_simple (1.16: )
Entering state 27
Now at end of input.
Reducing stack 0 by rule 37 (line 361):
   $1 = nterm expr_simple (1.16: )
-> $$ = nterm expr_select (1.16: )
Entering state 26
Reducing stack 0 by rule 33 (line 349):
   $1 = nterm expr_select (1.16: )
-> $$ = nterm expr_app (1.16: )
Entering state 25
Now at end of input.
Reducing stack 0 by rule 31 (line 343):
   $1 = nterm expr_app (1.16: )
-> $$ = nterm expr_op (1.16: )
Entering state 24
Now at end of input.
Reducing stack 0 by rule 12 (line 320):
   $1 = nterm expr_op (1.16: )
-> $$ = nterm expr_if (1.16: )
Entering state 23
Reducing stack 0 by rule 10 (line 315):
   $1 = nterm expr_if (1.16: )
-> $$ = nterm expr_function (1.16: )
Entering state 126
Reducing stack 0 by rule 9 (line 309):
   $1 = token LET (1.2-4: )
   $2 = nterm binds (1.5-11: )
   $3 = token IN (1.13-14: )
   $4 = nterm expr_function (1.16: )
-> $$ = nterm expr_function (1.2-16: )
Entering state 22
Reducing stack 0 by rule 2 (line 294):
   $1 = nterm expr_function (1.2-16: )
-> $$ = nterm expr (1.2-16: )
Entering state 21
Reducing stack 0 by rule 1 (line 292):
   $1 = nterm expr (1.2-16: )
-> $$ = nterm start (1.2-16: )
Entering state 20
Now at end of input.
Shifting token $end (1.17: )
Entering state 53
Cleanup: popping token $end (1.17: )
Cleanup: popping nterm start (1.2-16: )
1

I really don't get that part:

[...]
Reading a token: Next token is token ID (1.16: )
Shifting token ID (1.16: )
Entering state 1
Reading a token: Now at end of input.
[...]

I don't understand why the lexer is consuming the rest of the input as soon as it reach a unicode char. Isn't the ANY match supposed to consume the chars one by one?

As I said in the PR, I did not find anything better than explicitly throwing an error while lexing a unicode identifier.

@dezgeg
Copy link
Contributor

dezgeg commented May 10, 2018

A random guess: missing %option 8bit from the lexer?

@edolstra
Copy link
Member

That didn't help unfortunately. Apparently 8-bit is the default unless you're using certain table compression options.

@edolstra edolstra closed this in 1ad1923 May 11, 2018
@picnoir picnoir deleted the unicodeIdentifiers branch May 11, 2018 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants