tyru/php-rtf-parser

dev-master 2018-07-25 09:02 UTC

README

Run frontend script extract-text-from-rtf.php to extract only the text in rtf file.

php extract-text-from-rtf.php -f sample.rtf [-i <input encoding>] [-o <output encoding>]

Here are the default input/output encodings:

OS input output
Windows guess CP932
Others guess (detect from $LANG)

The input encoding is the encoding of rtf file. It is normally current code page of Windows on which a user created the file. For example, Windows in Japanese version, CP932. If input encoding is guess, it tries to find \ansicpg control word. \ansicpg declares the default character set used in the document unless it is \ansi (the default). if \ansicpgN (N is parameter) is found, it returns encoding string "cp<N>". For example, \ansicpg932 is found, it returns string "cp932". The library user can get the encoding by RtfParser\Document#getEncoding() method.

The output encoding is the encoding of standard output. For example, Windows in Japanese version, CP932 (cmd.exe encoding). Of course you can encode to UTF-8 like -o UTF-8. By default, on non-Windows platform, output encoding is detected by LANG environment variable. if it fails, 'UTF-8' is the default value.

These arguments are passed to mb_convert_encoding() function if both encodings are not same.

RtfParser

$scanner = new RtfParser\Scanner($text);
$parser = new RtfParser\Parser($scanner);
$text = '';
$doc = $parser->parse();
foreach ($doc->childNodes() as $node) {
  $text .= $node->text();
}
echo $text;

$parser->parse() returns RtfParser\Document instance. $doc->childNodes() returns array of RtfParser\Node\Node. Currently RtfParser\Node\Node interface only supports text() and name() method.