From 50d45c6deb0afd2e4222d2e33a45487a9d1fa676 Mon Sep 17 00:00:00 2001 From: Ralph Amissah Date: Sun, 23 Sep 2007 05:16:21 +0100 Subject: primarily todo with sisu documentation, changelog reproduced below: * start documenting sisu using sisu * sisu markup source files in data/doc/sisu/sisu_markup_samples/sisu_manual/ /usr/share/doc/sisu/sisu_markup_samples/sisu_manual/ * default output [sisu -3] in data/doc/manuals_generated/sisu_manual/ /usr/share/doc/manuals_generated/sisu_manual/ (adds substantially to the size of sisu package!) * help related edits * manpage, work on ability to generate manpages, improved * param, exclude footnote mark count when occurs within code block * plaintext changes made * shared_txt, line wrap visited * file:// link option introduced (in addition to existing https?:// and ftp://) a bit arbitrarily, diff here, [double check changes in sysenv and hub] * minor adjustments * html url match refinement * css added tiny_center * plaintext * endnotes fix * footnote adjustment to make more easily distinguishable from substantive text * flag -a only [flags -A -e -E dropped] controlled by modifiers --unix/msdos --footnote/endnote * defaults, homepage * renamed homepage (instead of index) implications for modifying skins, which need likewise to have any homepage entry renamed * added link to sisu_manual in homepage * css the css for the default homepage is renamed homepage.css (instead of index.css) [consider removing this and relying on html.css] * ruby version < ruby1.9 * place stop on installation and working with for now [ruby String.strip broken in ruby 1.9.0 (2007-09-10 patchlevel 0) [i486-linux], 2007-09-18:38/2] * debian/control restrict use to ruby > 1.8.4 and ruby < 1.9 * debian * debian/control restrict use to ruby > 1.8.4 and ruby < 1.9 * sisu-doc new sub-package for sisu documentation debian/control and sisu-doc.install --- .../sisu_manual/sisu_introduction/sax.xml | 599 +++++++++++++++++++++ 1 file changed, 599 insertions(+) create mode 100644 data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml (limited to 'data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml') diff --git a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml new file mode 100644 index 00000000..2b0d3432 --- /dev/null +++ b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml @@ -0,0 +1,599 @@ + + + + + + + Title: + + SiSU - Commands [0.58] + +
+ Creator: + + Ralph Amissah + +
+ Rights: + + Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3 + +
+ Type: + + information + +
+ Subject: + + ebook, epublishing, electronic book, electronic publishing, electronic document, electronic citation, data structure, citation systems, search + +
+ Date created: + + 2002-08-28 + +
+ Date issued: + + 2002-08-28 + +
+ Date available: + + 2002-08-28 + +
+ Date modified: + + 2007-09-16 + +
+ Date: + + 2007-09-16 + +
+ + + + 1 + + SiSU - Commands [0.58],
Ralph Amissah +
+
+ + 2 + + What is SiSU? + + + + 3 + + Description + + + + 4 + + 1. Introduction - What is SiSU? + + + + 5 + + SiSU is a system for document markup, publishing (in multiple +open standard formats) and search + + + + 6 + + SiSU1 is a2 framework for document +structuring, publishing and search, comprising of (a) a lightweight +document structure and presentation markup syntax and (b) an +accompanying engine for generating standard document format outputs +from documents prepared in sisu markup syntax, which is able to produce +multiple standard outputs that (can) share a common numbering system +for the citation of text within a document. + + + 1 + + "SiSU information Structuring Universe" or "Structured +information, Serialized Units".
also chosen for the meaning of +the Finnish term "sisu". +
+
+ + 2 + + Unix command line oriented + + +
+ + 7 + + SiSU is developed under an open source, software libre license +(GPL3). It has been developed in the context of coping with large +document sets with evolving markup related technologies, for which you +want multiple output formats, a common mechanism for +cross-output-format citation, and search. + + + + 8 + + SiSU both defines a markup syntax and provides an engine that +produces open standards format outputs from documents prepared with +SiSU markup. From a single lightly prepared document sisu custom +builds several standard output formats which share a common (text +object) numbering system for citation of content within a document +(that also has implications for search). The sisu engine works with an +abstraction of the document's structure and content from which it is +possible to generate different forms of representation of the document. +Significantly SiSU markup is more sparse than html and outputs +which include html, LaTeX, landscape and portrait pdfs, Open Document +Format (ODF), all of which can be added to and updated. SiSU is +also able to populate SQL type databases at an object level, which +means that searches can be made with that degree of granularity. +Results of objects (primarily paragraphs and headings) can be viewed +directly in the database, or just the object numbers shown - your +search criteria is met in these documents and at these locations within +each document. + + + + 9 + + Source document preparation and output generation is a two step +process: (i) document source is prepared, that is, marked up in sisu +markup syntax and (ii) the desired output subsequently generated by +running the sisu engine against document source. Output representations +if updated (in the sisu engine) can be generated by re-running the +engine against the prepared source. Using SiSU markup applied to +a document, SiSU custom builds various standard open output +formats including plain text, HTML, XHTML, XML, OpenDocument, LaTeX or +PDF files, and populate an SQL database with objects3 +(equating generally to paragraph-sized chunks) so searches may be +performed and matches returned with that degree of granularity ( e.g. +your search criteria is met by these documents and at these locations +within each document). Document output formats share a common object +numbering system for locating content. This is particularly suitable +for "published" works (finalized texts as opposed to works that are +frequently changed or updated) for which it provides a fixed means of +reference of content. + + + 3 + + objects include: headings, paragraphs, verse, tables, images, but not +footnotes/endnotes which are numbered separately and tied to the object +from which they are referenced. + + + + + 10 + + In preparing a SiSU document you optionally provide semantic +information related to the document in a document header, and in +marking up the substantive text provide information on the structure of +the document, primarily indicating heading levels and footnotes. You +also provide information on basic text attributes where used. The rest +is automatic, sisu from this information custom builds4 the +different forms of output requested. + + + 4 + + i.e. the html, pdf, odf outputs are each built individually and +optimised for that form of presentation, rather than for example the +html being a saved version of the odf, or the pdf being a saved version +of the html. + + + + + 11 + + SiSU works with an abstraction of the document based on its +structure which is comprised of its frame5 and the +objects6 it contains, which enables SiSU to represent +the document in many different ways, and to take advantage of the +strengths of different ways of presenting documents. The objects are +numbered, and these numbers can be used to provide a common base for +citing material within a document across the different output format +types. This is significant as page numbers are not suited to the +digital age, in web publishing, changing a browser's default font or +using a different browser means that text appears on different pages; +and in publishing in different formats, html, landscape and portrait +pdf etc. again page numbers are of no use to cite text in a manner that +is relevant against the different output types. Dealing with documents +at an object level together with object numbering also has implications +for search. + + + 5 + + the different heading levels + + + + 6 + + units of text, primarily paragraphs and headings, also any tables, +poems, code-blocks + + + + + 12 + + One of the challenges of maintaining documents is to keep them in a +format that would allow users to use them without depending on a +proprietary software popular at the time. Consider the ease of dealing +with legacy proprietary formats today and what guarantee you have that +old proprietary formats will remain (or can be read without proprietary +software/equipment) in 15 years time, or the way the way in which html +has evolved over its relatively short span of existence. SiSU +provides the flexibility of outputing documents in multiple +non-proprietary open formats including html, pdf7 and the ISO +standard ODF.8 Whilst SiSU relies on software, the +markup is uncomplicated and minimalistic which guarantees that future +engines can be written to run against it. It is also easily converted +to other formats, which means documents prepared in SiSU can be +migrated to other document formats. Further security is provided by the +fact that the software itself, SiSU is available under GPL3 a +licence that guarantees that the source code will always be open, and +free as in libre which means that that code base can be used updated +and further developed as required under the terms of its license. +Another challenge is to keep up with a moving target. SiSU +permits new forms of output to be added as they become important, (Open +Document Format text was added in 2006), and existing output to be +updated (html has evolved and the related module has been updated +repeatedly over the years, presumably when the World Wide Web +Consortium (w3c) finalises html 5 which is currently under development, +the html module will again be updated allowing all existing documents +to be regenerated as html 5). + + + 7 + + Specification submitted by Adobe to ISO to become a full open ISO +specification
<http://www.linux-watch.com/news/NS7542722606.html> +
+
+ + 8 + + ISO/IEC 26300:2006 + + +
+ + 13 + + The document formats are written to the file-system and available for +indexing by independent indexing tools, whether off the web like Google +and Yahoo or on the site like Lucene and Hyperestraier. + + + + 14 + + SiSU also provides other features such as concordance files and +document content certificates, and the working against an abstraction +of document structure has further possibilities for the research and +development of other document representations, the availability of +objects is useful for example for topic maps and the commercial law +thesaurus by Vikki Rogers and Al Krtizer, together with the flexibility +of SiSU offers great possibilities. + + + + 15 + + SiSU is primarily for published works, which can take advantage +of the citation system to reliably reference its documents. SiSU +works well in a complementary manner with such collaborative +technologies as Wikis, which can take advantage of and be used to +discuss the substance of content prepared in SiSU. + + + + 16 + + <http://www.jus.uio.no/sisu> + + + + 17 + + 2. How does sisu work? + + + + 18 + + SiSU markup is fairly minimalistic, it consists of: a (largely +optional) document header, made up of information about the document +(such as when it was published, who authored it, and granting what +rights) and any processing instructions; and markup within the +substantive text of the document, which is related to document +structure and typeface. SiSU must be able to discern the +structure of a document, (text headings and their levels in relation to +each other), either from information provided in the document header or +from markup within the text (or from a combination of both). Processing +is done against an abstraction of the document comprising of +information on the document's structure and its objects,[2] which the +program serializes (providing the object numbers) and which are +assigned hash sum values based on their content. This abstraction of +information about document structure, objects, (and hash sums), +provides considerable flexibility in representing documents different +ways and for different purposes (e.g. search, document layout, +publishing, content certification, concordance etc.), and makes it +possible to take advantage of some of the strengths of established ways +of representing documents, (or indeed to create new ones). + + + + 19 + + 3. Summary of features + + + + 20 + + sparse/minimal markup (clean utf-8 source texts). Documents are +prepared in a single UTF-8 file using a minimalistic mnemonic syntax. +Typical literature, documents like "War and Peace" require almost no +markup, and most of the headers are optional. + + + + 21 + + markup is easily readable/parsable by the human eye, (basic markup is +simpler and more sparse than the most basic HTML), [this may also be +converted to XML representations of the same input/source document]. + + + + 22 + + markup defines document structure (this may be done once in a header +pattern-match description, or for heading levels individually); basic +text attributes (bold, italics, underscore, strike-through etc.) as +required; and semantic information related to the document (header +information, extended beyond the Dublin core and easily further +extended as required); the headers may also contain processing +instructions. SiSU markup is primarily an abstraction of +document structure and document metadata to permit taking advantage of +the basic strengths of existing alternative practical standard ways of +representing documents [be that browser viewing, paper publication, sql +search etc.] (html, xml, odf, latex, pdf, sql) + + + + 23 + + for output produces reasonably elegant output of established industry +and institutionally accepted open standard formats.[3] takes advantage +of the different strengths of various standard formats for representing +documents, amongst the output formats currently supported are: + + + + 24 + + html - both as a single scrollable text and a segmented document + + + + 25 + + xhtml + + + + 26 + + XML - both in sax and dom style xml structures for further +development as required + + + + 27 + + ODF - open document format, the iso standard for document storage + + + + 28 + + LaTeX - used to generate pdf + + + + 29 + + pdf (via LaTeX) + + + + 30 + + sql - population of an sql database, (at the same object level +that is used to cite text within a document) + + + + 31 + + Also produces: concordance files; document content certificates (md5 or +sha256 digests of headings, paragraphs, images etc.) and html manifests +(and sitemaps of content). (b) takes advantage of the strengths +implicit in these very different output types, (e.g. PDFs produced +using typesetting of LaTeX, databases populated with documents at an +individual object/paragraph level, making possible granular search (and +related possibilities)) + + + + 32 + + ensuring content can be cited in a meaningful way regardless of +selected output format. Online publishing (and publishing in multiple +document formats) lacks a useful way of citing text internally within +documents (important to academics generally and to lawyers) as page +numbers are meaningless across browsers and formats. sisu seeks to +provide a common way of pinpoint the text within a document, (which can +be utilized for citation and by search engines). The outputs share a +common numbering system that is meaningful (to man and machine) across +all digital outputs whether paper, screen, or database oriented, (pdf, +HTML, xml, sqlite, postgresql), this numbering system can be used to +reference content. + + + + 33 + + Granular search within documents. SQL databases are populated at an +object level (roughly headings, paragraphs, verse, tables) and become +searchable with that degree of granularity, the output information +provides the object/paragraph numbers which are relevant across all +generated outputs; it is also possible to look at just the matching +paragraphs of the documents in the database; [output indexing also work +well with search indexing tools like hyperestraier]. + + + + 34 + + long term maintainability of document collections in a world of +changing formats, having a very sparsely marked-up source document +base. there is a considerable degree of future-proofing, output +representations are "upgradeable", and new document formats may be +added. e.g. addition of odf (open document text) module in 2006 and in +future html5 output sometime in future, without modification of +existing prepared texts + + + + 35 + + SQL search aside, documents are generated as required and static once +generated. + + + + 36 + + documents produced are static files, and may be batch processed, this +needs to be done only once but may be repeated for various reasons as +desired (updated content, addition of new output formats, updated +technology document presentations/representations) + + + + 37 + + document source (plaintext utf-8) if shared on the net may be used as +input and processed locally to produce the different document outputs + + + + 38 + + document source may be bundled together (automatically) with associated +documents (multiple language versions or master document with +inclusions) and images and sent as a zip file called a sisupod, if +shared on the net these too may be processed locally to produce the +desired document outputs + + + + 39 + + generated document outputs may automatically be posted to remote sites. + + + + 40 + + for basic document generation, the only software dependency is +Ruby, and a few standard Unix tools (this covers plaintext, +HTML, XML, ODF, LaTeX). To use a database you of course need that, and +to convert the LaTeX generated to pdf, a latex processor like tetex or +texlive. + + + + 41 + + as a developers tool it is flexible and extensible + + + + 42 + + Syntax highlighting for SiSU markup is available for a number of +text editors. + + + + 43 + + SiSU is less about document layout than about finding a way with +little markup to be able to construct an abstract representation of a +document that makes it possible to produce multiple representations of +it which may be rather different from each other and used for different +purposes, whether layout and publishing, or search of content + + + + 44 + + i.e. to be able to take advantage from this minimal preparation +starting point of some of the strengths of rather different established +ways of representing documents for different purposes, whether for +search (relational database, or indexed flat files generated for that +purpose whether of complete documents, or say of files made up of +objects), online viewing (e.g. html, xml, pdf), or paper publication +(e.g. pdf)... + + + + 45 + + the solution arrived at is by extracting structural information about +the document (about headings within the document) and by tracking +objects (which are serialized and also given hash values) in the manner +described. It makes possible representations that are quite different +from those offered at present. For example objects could be saved +individually and identified by their hashes, with an index of how the +objects relate to each other to form a document. + + + + 0 + + Endnotes + + + +
-- cgit v1.2.3