Skip to content

Commit

Permalink
Item13447: CharsetConverter enhancements.
Browse files Browse the repository at this point in the history
  • Loading branch information
gac410 committed Jun 7, 2015
1 parent 30426f7 commit 5a5735b
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 18 deletions.
25 changes: 19 additions & 6 deletions data/System/CharsetConverterContrib.txt
@@ -1,4 +1,4 @@
%META:TOPICINFO{author="ProjectContributor" version="1" date="1402505433"}%
%META:TOPICINFO{author="ProjectContributor" version="1" date="1433640057"}%
<!--
* Set SHORTDESCRIPTION = %$SHORTDESCRIPTION%
-->
Expand All @@ -9,7 +9,7 @@
%TOC%

This module is used to convert the character set encoding used in
*RcsWrap and RcsLite* stores.
*RcsWrap and !RcsLite* stores.

The character set encoding determines the range of characters that can be
used in for naming wiki topics and attachments, and in content
Expand All @@ -36,7 +36,7 @@ associated with the entire database, and not with individual topics.
It was even possible to paste content in a different encoding into the
text editor and have it stored in that encoding, resulting in what looked
like garbled topics.

Ideally all Foswikis should use UTF-8, even those that are still using
older Foswikis, but we have a legacy of existing sites that don't. So we
need some way to convert an RCS-based wiki from any existing character
Expand All @@ -52,7 +52,7 @@ histories.
Even if you don't have an immediate need for non-western character sets
this is worth doing, as Foswiki 1.2.0 and later work exclusively with
UTF-8 content.

Note that this module converts all the histories of all your topics,
as well as the latest version of the topic. It also maps all web,
topic and attachment names. It does not, however, touch the _content_ of
Expand All @@ -61,7 +61,7 @@ attachments.
---++ Installation
This extension is tested with Foswiki 1.1.0 and later. If your Foswiki
installation is older than that, then upgrade your Foswiki first.
Note that the extension *is not required* and *will not work* on Foswiki
Note that the extension *is not required* on Foswiki
1.2.0 or later. If your requirement is part of an upgrade to Foswiki 1.2.0,
then either:
1 convert the 1.1.x Foswiki to UTF-8 using this extension first, or
Expand All @@ -70,6 +70,8 @@ then either:
%$INSTALL_INSTRUCTIONS%

---++ Usage
<div class="foswikiHelp">%X% *The conversion process updates data in-place, and cannot be reversed. Be sure to take a backup before running this tool.* </div>

The convertor is used from the command-line on your wiki server (if you do
not have access to the command line then we are sorry, but there is currently
no way for you to use the conversion).
Expand All @@ -91,6 +93,9 @@ Options:
| =-q= | quiet - work silently (unless there's an error) |
| =-a= | abort - on error (default is to report and continue) |
| =-r= | repair - detect the encoding of each string and repair inconsistencies. |
| | __Expert options__ |
| =-web=webname= | Restrict conversion to a single web and it's subwebs. |
| =-encoding=charset= | Override the source encoding. |

Only use =-r= if your site may contain content which cannot be decoded
using the {Site}{CharSet} (if this is the case, -i will abort with an
Expand All @@ -102,10 +107,18 @@ can follow. These are of two types:
* =topic-path=actual-encoding=
The first allows you to override the encoding of *all* strings detected as
=detected-encoding=, while the second allows you to select an individual topic
and override the encoding of the content of just that topic. If you need to
and override the encoding of the content of just that topic. If you need to
override the encoding of a web or topic name, use =:N= after the topic-path
e.g. =Sandbox/NorthKorea:N=EUC-KR=

Although this exension is intended for use on Foswiki 1.1, there may be cases
where an individual web requires conversion on a Foswiki 1.2 system. For example,
conversion of a single web migrated at a later date from an older system. For
example, convert the oops web from =iso-8859-1= on a system already converted
to =utf-8=. *Use extreme caution converting individual webs. Foswiki does
*not* support mixed encoding.
=perl convert_charset.pl -web=Oops -encoding=iso-8859-1 -i=

Once you have run the script without -i, all:
* web names
* topic names
Expand Down
60 changes: 48 additions & 12 deletions lib/Foswiki/Contrib/CharsetConverterContrib.pm
Expand Up @@ -29,6 +29,9 @@ our $session;
my $convertCount = 0;
my $renameCount = 0;

my $storeEncoding;
my $storeVersion;

sub report {
return if $options->{-q};
print STDERR join( ' ', @_ ) . "\n";
Expand Down Expand Up @@ -59,7 +62,7 @@ sub detect_encoding {
sub _convert_string {
my ( $old, $where, $extra ) = @_;
return 0 unless defined $old;
my $e = $Foswiki::cfg{Site}{CharSet};
my $e = $storeEncoding;

if ( $options->{-r} ) {
my $i = $where . ( $extra eq 'name of' ? ':N' : '' );
Expand All @@ -70,7 +73,7 @@ sub _convert_string {
else {
require Encode::Detect::Detector;
my $de = Encode::Detect::Detector::detect($old);
if ( $de && $de !~ /^$Foswiki::cfg{Site}{CharSet}$/i ) {
if ( $de && $de !~ /^$storeEncoding$/i ) {

# Support overrides
if ( $options->{$de} ) {
Expand All @@ -90,7 +93,7 @@ sub _convert_string {
# Special case: if the site encoding is iso-8859-* or utf-8 and the string
# contains only 7-bit characters, then don't bother transcoding it
# (irrespective of any (probably incorrect) detected encoding)
if ( $Foswiki::cfg{Site}{CharSet} =~ /^(utf-8|iso-8859-1)$/
if ( $storeEncoding =~ /^(utf-8|iso-8859-1)$/
&& $old !~ /[^\x00-~]/ )
{
return 0;
Expand Down Expand Up @@ -141,7 +144,30 @@ sub convert_database {

$options = \%args;

report "Database is currently using $Foswiki::cfg{Site}{CharSet}";
if ( $Foswiki::VERSION < 1.1.999 ) {
report "Detected Foswiki Version 1.1 or older";
$storeVersion = 1;
}
else {
report "Detected Foswiki Version >= 1.2";
$storeVersion = 2;
}

if ( $options->{-encoding} ) {
$storeEncoding = $options->{-encoding};
report "Store encoding ignored, using encoding $storeEncoding";
}
elsif ( $storeVersion == 2 ) {
$storeEncoding = $Foswiki::cfg{Store}{Encoding} || 'utf-8';
report "Foswiki 1.2 Database, using encoding $storeEncoding";
}
else {
$storeEncoding = $Foswiki::cfg{Site}{CharSet};
report "Foswiki 1.1 Database, using encoding $storeEncoding";
}

my $web = $options->{-web} || '';
report "Processing restriced to $web web" if $web;

# Must do this before we construct the session object, otherwise the store
# cache gets populated with Wrap handlers
Expand All @@ -151,17 +177,18 @@ sub convert_database {
# First we rename all webs and files as necessary by
# calling the recursive collection rename on the root web
foreach my $tree ( $Foswiki::cfg{DataDir}, $Foswiki::cfg{PubDir} ) {
_rename_collection( $tree, '' );
_rename_collection( $tree, $web );
}

# All file and directory names should now be utf8

# Now we convert the content of topics
_convert_topics_contents('');
_convert_topics_contents($web);

# And that's it!
print STDERR
"CONVERSION FINISHED: Moved: $renameCount, Converted $convertCount\n";
report "CONVERSION FINISHED: "
. ( ( $options->{-i} ) ? '(simulated) ' : '' )
. "Moved: $renameCount, Converted $convertCount\n";

$session->finish();
}
Expand All @@ -184,7 +211,7 @@ sub _rename_collection {
foreach my $e ( readdir($dir) ) {
next if $e =~ /^\./;

#print STDERR "Collected $e $Foswiki::cfg{Site}{CharSet}\n";
#print STDERR "Collected $e $storeEncoding\n";
my $ne = $e;
if ( _convert_string( $ne, "$web/$ne", "name of" ) ) {
if ( $ne ne $e ) {
Expand Down Expand Up @@ -213,9 +240,18 @@ sub _convert_topic {
my $converted = 0;

# Convert .txt,v
my $handler =
Foswiki::Store::VC::RcsLiteHandler->new( $session->{store}, $web,
$topic );
my $handler;

if ( $storeVersion == 2 ) {
$handler =
Foswiki::Store::Rcs::RcsLiteHandler->new( $session->{store}, $web,
$topic );
}
else {
$handler =
Foswiki::Store::VC::RcsLiteHandler->new( $session->{store}, $web,
$topic );
}
my $uh = Encode::decode_utf8("$web.$topic");

# Force reading of the topic history, all the way down to revision 1
Expand Down

0 comments on commit 5a5735b

Please sign in to comment.