Ferenda

Latest version: v0.3.0

Safety actively analyzes 682471 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.3.0

===========================
This release adds support for processing things in parallel, both by
using multiple processes on a single machine, and also by running
"build clients" on any number of machines, which run jobs managed by a
central queue.

Parsing of PDF files has been improved by the :py:class:`.PDFReader`
and :py:class:`.PDFAnalyzer` (new) classes. See :ref:`pdfreader`.

In addition, a lot of the included repositorys have been
overhauled. The general repos :py:class:`.MediaWiki` and
:py:class:`~ferenda.sources.general.Keyword` should be usable for most
projects by creating a subclass and configuring it.


Backwards-incompatible changes:
-------------------------------
* DocumentRepository and all derived classes now takes an optional
first config argument. If present, this should be a LayeredConfig
object that contains the repo configuration. If not provided, a
blank LayeredConfig object is created. All other optional keyword
arguments are then added to the config object. If you have
overridden __init__ for your docrepo, you'll need to make sure to
handle this first argument.
* The Newscriteria class has been removed, and
DocumentRepository.news_criteria with it. The Facet framework is now
used to define news feeds (as well as TOC pages, the ReST API and
fulltext indexing)
* The PDFReader constructor now takes, as first argument, a list of
pdfreader.Page objects. Normally, a client won't have these but must
instead provide a filename of a PDF file through the filename
argument (which used to be the first argument, but must now be
specified as a named argument).
* the getfont() method of pdfreader.Textbox objects used to return a
straight dict of strings, but has now been replaced with a font
property that is now a LayeredConfig object with proper typing. Code
like "int(textbox.getfont()['size'])" should now be written like
"textbox.font.size".

New features:
-------------
* The default serialization of Element objects to XHTML now inserts
appropriate dcterms:isPartOf statements when one element with a URI
is contained within another element with another URI. Custom element
classes can change this by changing the partrelation property of the
included document.
* Serialization of Element documents to XHTML now omits namespaces
defined in self.namespaces, but which never actually occur in the
data.
* CitationParser.parse_string and .parse_recursive now has an optional
predicate argument that determines the RDF predicate between the
refering and the referred resources (by default, this is
dcterms:references)
* manager (and by extension ./ferenda-build.py) has new commands that
allows processing jobs in parallell (see Advanced > Parallel
processing)
* The ferenda.sources.general.wiki can now transform mediawiki markup
to Element objects.
* The ferenda.sources.general.keyword can be used to build keyboard
hubs from all concepts that your documents point to through a
dcterms:subject property (as well as things in a wiki docrepo, and
configurable other sources).
* The ferenda.sources.legal.se docrepos have been updated generally
and are now close to being able to replicate the function set of
https://lagen.nu/ (which was the main motivation with this codebase
all along).
* ferenda.testutil.assertEqualXML now has a tidy_xhtml argument which
runs the XML documents to be compared through HTML tidy (in XML
mode) in order to produce easier-to-read diffs.
* Transformer now outputs the equivalent xsltproc command if the
environment variable FERENDA_TRANSFORMDEBUG is set.

* The relate() action now uses dependency management to avoid costly
re-indexing if no changes have been made to a document.
* TOC and newsfeed generation now uses dependency management to avoid
re-generating if no changes in the underlying data has occurred.
* Documentation in general has been improved (readers, testing).

Infrastructural changes:
------------------------
* Ferenda now uses the CI service Appveyor to automatically run the
entire test suite under Windows on every commit.
* LayeredConfig is now a separate package and not included with
Ferenda. It has been generalized and can take any number of
configuration sources (in the form of object instances) as
initialization arguments. Classes that provide configuration sources
from code defaults, INI files, command line arguments, environment
variables and more are included. It also has two new class methods,
.set and .get.

0.2

Backwards-incompatible changes:
-------------------------------
* CompositeRepository.parse now raises ParseError if no subrepository
is able to parse the given basefile.

New features:
-------------
* ferenda.CompositeRepository.parse no longer requires that all
subrepos have storage_policy == "dir".
* Setting ferenda.DocumentStore.config now updates the associated DocumentStore
object with the config.datadir parameter
* New method ferenda.DocumentRepository.construct_sparql_query()
allows for more complex overrides than just setting the
sparql_annotations class attribute.
* New method DocumentRepository.download_is_different() is used to
control whether a newly downloaded resource is semantically
different from a previously downloaded resource (to avoid having
each ASP.Net VIEWSTATE change result in an archived document).
* New method DocumentRepository.parseneeded(): returns True iff
parsing of the document is needed (logic moved from
ferenda.decorators.parseifneeded)
* New class variable ferenda.DocumentRepository.required_predicates:
Controls which predicates that is expected to be in the output data
from .parse()
* The method ferenda.DocumentRepository.download_if_needed() now sets both
the If-None-match and If-modified-since HTTP headers.
* The method ferenda.DocumentRepository.render_xhtml() now creates RDFa 1.1
* New 'compress' parameter (Can either be empty or "bz2") controls whether
intermediate files are compressed to save space.
* The method ferenda.DocumentStore.path() now takes an extra storage_policy parameter.
* The class ferenda.DocumentStore now stores multiple basefiles in a
single directory even when storage_policy == "dir" for all methods
that cannot handle attachments (like distilled_path,
documententry_path etc)
* New methods ferenda.DocumentStore.open_intermediate(), .serialized_path() and
open_serialized()
* The decorator ferenda.decorators.render (by default called when
calling DocumentRepository.parse()) now serialize the entire
document to JSON, which later can be loaded to recreate the entire
document object tree. Controlled by config parameter serializejson.
* The decorator ferenda.decorators.render now validates that required triples (as
determined by .required_predicates) are present in the output.
* New decorator ferenda.decorators.newstate, used in
ferenda.FSMParser
* The docrepo ferenda.Devel now has a new csvinventory action
* The functions ferenda.Elements.serialize() and .deserialize() now takes a format parameter,
which can be either "xml" (default) or "json". The "json" format
allows for full roundtripping of all documents.
* New exception ferenda.errors.NoDownloadedFileError.
* The class ferenda.PDFReader now handles any word processing format
that OpenOffice/LibreOffice can handle, by first using soffice to
convert it to a PDF. It also handles PDFs that consists entirely of
scanned pages without text information, by first running the images
through the tesseract OCR engine. Finally, a new keep_xml parameter
allows for either removing the intermediate XML files or compressing
them using bz2 to save space.
* New method ferenda.PDFReader.is_empty()
* New method ferenda.PDFReader.textboxes() iterates through all
textboxes on all pages. The user can provide a glue function to
automatically concatenate textboxes that should be considered part
of the same paragraph (or other meaningful unit of text).
* New debug method ferenda.PDFReader.drawboxes() can use the same glue
function, and creates a new pdf with all the resulting textboxes
marked up. (Requires PyPDF2 and reportlab, which makes this
particular feature Python 2-only).
* ferenda.PDFReader.Textbox objects can now be added to each other to form
larger Textbox objects.
* ferenda.Transformer now optionally logs the equivalent xsltproc
command line when transforming using XSLT.
* new method ferenda.TripleStore.update(), performs SPARQL
UPDATE/DELETE/DROP/CLEAR queries.
* ferenda.util has new gYearMonth and gYear classes that subclass
datetime.date, but are useful when handling RDF literals that should
have the datatype xsd:gYearMonth (or xsd:gYear)

0.2.0

===========================

This release adds a REST-based HTTP API and includes a lot of
infrastructure to support repo-defined querying and aggregation of
arbitrary document properties. This also led to a generalization of
the TocCriteria class and associated methods, which are now replaced
by the Facet class.

The REST API should be considered an alpha version and is definitly
not stable.

Backwards-incompatible changes:
-------------------------------
* The class TocCriteria and the DocumentRepository methods
toc_predicates, toc_criteria et al have been removed and replaced
with the Facet class and similar methods.
* ferenda.sources.legal.se.direktiv.DirPolopoly and
ferenda.sources.legal.se.propositioner.PropPolo has been renamed to
...DirRegeringen and ...PropRegeringen, respectively.

New features:
-------------
* A REST API enables clients to do faceted querying (ie document whose
properties have specified values), full-text search or combinations.
* Several popular RDF ontologies are included and exposed using the REST
API. A docrepo can include custom RDF ontologies that are used in the
same way. All ontologies used by a docrepo is available as a RDFLib
graph from the .ontologies property
* Docrepos can include extra/common data that describes things which
your documents refer to, like companies, publishing entities, print
series and abstract things like the topic/keyword of a document. This
information is provided in the form of a RDF graph, which is also
exposed using the REST API. All common data defined for a docrepo is
available as the .commondata property.
* New method DocumentRepository.lookup_resource lookup resource URIs
from the common data using foaf:name labels (or any other RDF
predicate that you might want to use)
* New class Facet and new methods DocumentRepository.facets,
.faceted_data, facet_query and facet_seltct to go with that
class. These replace the TocCriteria class and the methods
DocumentRepository.toc_select, .toc_query, .toc_criteria and
.toc_predicates.
* The WSGI app now provides content negotiation using file extensions as
well as a the HTTP Accept header, ie. requesting
"http://localhost:8000/res/base/123.ttl" gives the same result as
requesting the resource "http://localhost:8000/res/base/123" using the
"Accept: text/turtle" header.
* New exceptions ferenda.errors.SchemaConflictError and .SchemaMappingError.
* The FulltextIndex class now creates a schema in the underlying
fulltext enginge based upon the used docrepos, and the facets that
those repos define. The FulltextIndex.update method now takes
arbitrary arguments that are stored as separate fields in the
fulltext index. Similarly, the FulltextIndex.query method now takes
arbitrary arguments that are used to limit the search to only those
documents whose properties match the arguments.
* ferenda.Devel has a new ´destroyindex' action which completely
removes the fulltext index, which might be needed whenever its
schema changes. If you add any new facets, you'll need to run
"./ferenda-build.py devel destroyindex" followed by
"./ferenda-build.py all relate --all --force"
* The docrepos ferenda.sources.tech.RFC and W3Standards have been
updated with their own ontologies and commondata. The result of
parse now creates better RDF, in particular things like
dcterms:creator and dcterms:subject not point to URIs (defined in
commondata) instead of plain string literals.

Infrastructural changes:
------------------------
* cssmin is no longer bundled within ferenda. Instead it's marked as a
dependency so that pip/easy_install automatically downloads it from
pypi.
* The prefix for DCMI Metadata Terms have been changed from "dct" to
"dcterms" in all code and documentation.
* testutil now has a Py23DocChecker that can be used with
doctest.DocTestSuite() to enable single-source doctests that work
with both python 2 and 3.
* New method ferenda.util.json_default_date, usable as the default
argument of json.dump to serialize datetime object into JSON
strings.

0.1.7

===========================

This release mainly updates the swedish legal sources, which now does
a decent job of downloading and parsing a variety of legal
information. During the course of that work, a number of changes
needed to be made to the core of ferenda. The release is still a part
of the 0.1 series because the REST API isn't done yet (once it's in,

0.1.6.1

=============================

This hotfix release corrected an error in setup.py that prevented
installs when using python 3.

0.1.6

===========================

This release mainly contains bug fixes and development infrastructure
changes. 95 % of the main code base is covered through the unit test
suite, and the examples featured in the documentation is now
automatically tested as well. Whenever discrepancies between the map
(documentation) and reality (code) has been found, reality has been
adjusted to be in accordance with the map.

The default HTML5 theme has also been updated, and should scale nicely
from screen widths ranging from mobile phones in portrait mode to
wide-screen desktops. The various bundled css and js files has been
upgraded to their most recent versions.

Backwards-incompatible changes:
-------------------------------
* The DocumentStore.open_generated method was removed as noone was
using it.

* The (non-documented) modules legalref and legaluri, which were
specific to swedish legal references, have been moved into the
ferenda.sources.legal.se namespace

* The (non-documented) feature where CSS files specified in the
configuration could be in SCSS format, and automatically
compiled/transformed, has been removed, since the library used
(pyScss) currently has problems on the Python 3 platform.

New features:
-------------
* The :meth:`ferenda.Devel.mkpatch` command now actually works.

* The `republishsource` configuration parameter is now available, and
controls whether your Atom feeds link to the original document file
as it was fetched from the source, or to the parsed version. See
:ref:`configuration`.

* The entire RDF dataset for a particular docrepo is now available
through the ReST API in various formats using the same content
negotiation mechanisms as the documents themselves. See :doc:`wsgi`.

* ferenda-setup now auto-configures ``indextype`` (and checks whether
ElasticSearch is available, before falling back to Whoosh) in
addition to ``storetype``.

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.