Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Informational] PowerShells Default encoding is UTF-8 without a BOM for textfiles #59

Open
Kriegel opened this issue Aug 16, 2020 · 3 comments

Comments

@Kriegel
Copy link

Kriegel commented Aug 16, 2020

Hi Dan!

In reaction to your great effort about File encoding, I like to point you to the following information about Microsofts decision about Textfile encoding. (fi you do not allready seem it):

Default encoding is UTF-8 without a BOM except for New-ModuleManifest

Please allow me some words about it and your module.
I am also a BIG fan of the BOM because files without BOM force us to guess the encoding ... so I am sad about MS decision too ...

I have the impressio that your Module try to serve 2 purposes. Encoding and Formatting.
I think it is better to separate this totally. So its better maintainable and contributors having it easier to contribute to one of this topic.

Anny way...
I really, really appreciate yout work done here.

Have a good time, stay well
Peter

@DTW-DanWard
Copy link
Owner

DTW-DanWard commented Aug 16, 2020

Thank you! I'm glad to see folks are still using this Beautifier tool. I've had no time to update it the past few years. Three or four years ago I was looking into rewriting in entirely with AST and had some crazy awesome ideas about functionality to add: obfuscation, auto-wrapping of functions in regions with the text Function: [function name], users could write their own rules, etc. Unfortunately I ran into a show stopper and had to abandon those ideas.

Yes, originally I had separated the encoding and formatting functionality into 2 separate modules and it was more friction for users to get it running so I ended up combining them. Of course, this was before the code was up in PowerShell Gallery, so maybe now I could separate them.

The automatic adding of the UTF8 BOM was something I had to do. I can't remember the exact API (probably the tokenize API in the .NET framework, not a particular cmdlet) but it would fail with an exception if there was non-ASCII code in a file without the UTF8 BOM. It wasn't smart enough to look ahead and determine the proper encoding. Of course this was with Windows PowerShell / earlier .NET code so it's possible & likely they fixed that with PS 6 / .NET Core. However if I remove that auto-adding of UTF8, it breaks for Windows PowerShell users. I guess I could check to see which version of PowerShell they are using....

Thank you again for the kind words!

@Kriegel
Copy link
Author

Kriegel commented Aug 17, 2020

Hi Dan,

If I understand correctly, your Module rely on the "old" .NET Tokenizer!?

This .NET Tokenizer, knows only 20 different token types.

Since in PowerShell 3, a new and more powerful .NET Parser was introduced that breaks up PowerShell Code in a much more detailed range of token.

[System.Management.Automation.Language.Parser]

The new Parser knows 150 different token kinds, each of which can be decorated with 26 token flags.
This provides a very detailed picture, especially when it comes to nested token.
Additionally it returns the Abstract Syntax Tree (AST).

In my experience , the AST is not the holy grail, I prefer Tokens and using the AST only as a helper.

users could write their own rules

Yes! The PSScriptAnalyzer and his cmdlet Invoke-Formatter offer this in a very complicate way... (and rules are mostly in C# :-( )

obfuscation

Is the opposit to pritty-Print / beautify PowerShell code lol
Use the conversion to a Base64 String and your done lol

Do you plan further development to this module?
Then I think it will end in a complete rewrite.

Currently I am developing my own PowerShell beautifier here on Github in an very early Stage.
BeautyOfPower

In my Testings I only use the new Parser that reads the PowerShell sourcecode files without Problems.
Tested only with Windows PowerShell 5.1 and 7 now.

Even your "Bad" testing scripts UTF8_NoBOM.ps1 (and the others without BOM).
So currently I do not care about Encoding.
(And if so, I think this is a topic that must be solved by the enduser not in my Module, except to warn the user and display Encoding Errors).

I do not like to present other mans (hard) work as mine!
If you do not Plan to develop your Powershell Beautifier further, please allow me (explicit) to copy over (adopt) code AND
to copy over large parts of your documentation to my Module "BautyOfPower" from your Module here.

(I also like your Documentation Style and the deepness to dive into internals)

@DTW-DanWard
Copy link
Owner

Hi - sorry for the very late reply, this got lost in my Inbox. Yes, I've abandoned this for now and have no intention of picking it back up. Yes, I built this on the older tokenizer and the code does need a rewrite but I don't have time for that. Feel free to borrow any code or ideas that you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants