fonduer Changelog

0.3.0

-------------------

Added
^^^^^
* `lukehsiao`_: Add supporting functions for incremental knowledge base
construction. (`154 <https://github.com/HazyResearch/fonduer/pull/154>`_)
* `j-rausch`_: Added alpha spacy support for Japanese tokenizer.
* `senwu`_: Add sparse logistic regression support.
* `senwu`_: Support Python 3.7.
* `lukehsiao`_: Allow user to change featurization settings by providing
``.fonduer-config.yaml`` in their project.
* `lukehsiao`_: Add a new Mention object, and have Candidate objects be
composed of Mention objects, rather than directly of Spans. This allows a
single Mention to be reused in multiple relations.
* `lukehsiao`_: Improved connection-string validation for the Meta class.

Changed
^^^^^^^
* `j-rausch`_: ``Document.text`` now returns the modified document text, based
on the user-defined html-tag stripping in the parsing stage.
* `j-rausch`_: ``Ngrams`` now has a ``n_min`` argument to specify a minimum
number of tokens per extracted n-gram.
* `lukehsiao`_: Rename ``BatchLabelAnnotator`` to ``Labeler`` and
``BatchFeatureAnnotator`` to ``Featurizer``. The classes now support multiple
relations.
* `j-rausch`_: Made spacy tokenizer to default tokenizer, as long as there
is (alpha) support for the chosen language. lingual argument now
specifies whether additional spacy NLP processing shall be performed.
* `senwu`_: Reorganize the disc model structure.
(`126 <https://github.com/HazyResearch/fonduer/pull/126>`_)
* `lukehsiao`_: Add ``session`` and ``parallelism`` as a parameter to all UDF
classes.
* `j-rausch`_: Sentence splitting in lingual mode is now performed by
spacy's sentencizer instead of the dependency parser. This can lead to
variations in sentence segmentation and tokenization.
* `j-rausch`_: Added ``language`` argument to ``Parser`` for specification
of language used by ``spacy_parser``. E.g. ``language='en'.
* `senwu`_: Change weak supervision learning framework from numbskull to
`MeTaL <https://github.com/HazyResearch/metal>_`.
(`119 <https://github.com/HazyResearch/fonduer/pull/119>`_)
* `senwu`_: Change learning framework from Tensorflow to PyTorch.
(`115 <https://github.com/HazyResearch/fonduer/pull/115>`_)
* `lukehsiao`_: Blacklist <script> nodes by default when parsing HTML docs.
* `lukehsiao`_: Reorganize ReadTheDocs structure to mirror the repository
structure. Now, each pipeline phase's user-facing API is clearly shown.
* `lukehsiao`_: Rather than importing ambiguously from ``fonduer`` directly,
disperse imports into their respective pipeline phases. This eliminates
circular dependencies, and makes imports more explicit and clearer to the
user where each import is originating from.
* `lukehsiao`_: Provide debug logging of external subprocess calls.
* `lukehsiao`_: Use ``tdqm`` for progress bar (including multiprocessing).
* `lukehsiao`_: Set the default PostgreSQL client encoding to "utf8".
* `lukehsiao`_: Organize documentation for ``data_model_utils`` by modality.
(`85 <https://github.com/HazyResearch/fonduer/pull/85>`_)
* `lukehsiao`_: Rename ``lf_helpers`` to ``data_model_utils``, since they can
be applied more generally to throttlers or used for error analysis, and are
not limited to just being used in labeling functions.
* `lukehsiao`_: Update the CHANGELOG to start following `KeepAChangelog
<https://keepachangelog.com/en/1.0.0/>`_ conventions.

Removed
^^^^^^^
* `lukehsiao`_: Remove the XMLMultiDocPreprocessor.
* `lukehsiao`_: Remove the ``reduce`` option for UDFs, which were unused.
* `lukehsiao`_: Remove get parent/children/sentence generator from Context.
(`87 <https://github.com/HazyResearch/fonduer/pull/87>`_)
* `lukehsiao`_: Remove dependency on ``pdftotree``, which is currently unused.

Fixed
^^^^^
* `j-rausch`_: Improve ``spacy_parser`` performance. We split the lingual
parsing pipeline into two stages. First, we parse structure and gather all
sentences for a document. Then, we merge and feed all sentences per document
into the spacy NLP pipeline for more efficient processing.
* `senwu`_: Speed-up of ``_get_node`` using caching.
* `HiromuHota`_: Fixed bug with Ngram splitting and empty TemporarySpans.
(`108 <https://github.com/HazyResearch/fonduer/pull/108>`_,
`112 <https://github.com/HazyResearch/fonduer/pull/112>`_)
* `lukehsiao`_: Fixed PDF path validation when using ``visual=True`` during
parsing.
* `lukehsiao`_: Fix Meta bug which would not switch databases when init() was
called with a new connection string.

.. note::
With the addition of Mentions, the process of Candidate extraction has
changed. In Fonduer v0.2.3, Candidate extraction was as follows:

.. code:: python

candidate_extractor = CandidateExtractor(PartAttr,
[part_ngrams, attr_ngrams],
[part_matcher, attr_matcher],
candidate_filter=candidate_filter)

candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)

With this release, you will now first extract Mentions and then extract
Candidates based on those Mentions:

.. code:: python

Mention Extraction
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
temp_ngrams = MentionNgramsTemp(n_max=2)
volt_ngrams = MentionNgramsVolt(n_max=1)

Part = mention_subclass("Part")
Temp = mention_subclass("Temp")
Volt = mention_subclass("Volt")
mention_extractor = MentionExtractor(
session,
[Part, Temp, Volt],
[part_ngrams, temp_ngrams, volt_ngrams],
[part_matcher, temp_matcher, volt_matcher],
)
mention_extractor.apply(docs, split=0, parallelism=PARALLEL)

Candidate Extraction
PartTemp = candidate_subclass("PartTemp", [Part, Temp])
PartVolt = candidate_subclass("PartVolt", [Part, Volt])

candidate_extractor = CandidateExtractor(
session,
[PartTemp, PartVolt],
throttlers=[temp_throttler, volt_throttler]
)

candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)

Furthermore, because Candidates are now composed of Mentions rather than
directly of Spans, to get the Span object from a mention, use the ``.span``
attribute of a Mention.

.. note::
Fonduer has been reorganized to require more explicit import syntax. In
Fonduer v0.2.3, nearly everything was imported directly from fonduer:

.. code:: python

from fonduer import (
CandidateExtractor,
DictionaryMatch,
Document,
FeatureAnnotator,
GenerativeModel,
HTMLDocPreprocessor,
Intersect,
LabelAnnotator,
LambdaFunctionMatcher,
MentionExtractor,
Meta,
Parser,
RegexMatchSpan,
Sentence,
SparseLogisticRegression,
Union,
candidate_subclass,
load_gold_labels,
mention_subclass,
)

With this release, you will now import from each pipeline phase. This makes
imports more explicit and allows you to more clearly see which pipeline
phase each import is associated with:

.. code:: python

from fonduer import Meta
from fonduer.candidates import CandidateExtractor, MentionExtractor
from fonduer.candidates.matchers import (
DictionaryMatch,
Intersect,
LambdaFunctionMatcher,
RegexMatchSpan,
Union,
)
from fonduer.candidates.models import candidate_subclass, mention_subclass
from fonduer.features import Featurizer
from metal.label_model import LabelModel GenerativeModel in v0.2.3
from fonduer.learning import SparseLogisticRegression
from fonduer.parser import Parser
from fonduer.parser.models import Document, Sentence
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.supervision import Labeler, get_gold_labels

0.2.3

-------------------

Added
^^^^^
* `lukehsiao`_: Support Figures nested in Cell contexts and Paragraphs in
Figure contexts.
(`84 <https://github.com/HazyResearch/fonduer/pull/84>`_)

0.2.2

-------------------

.. note::
Version 0.2.0 and 0.2.1 had to be skipped due to errors in uploading those
versions to PyPi. Consequently, v0.2.2 is the version directly after
v0.1.8.

.. warning::
This release is NOT backwards compatable with v0.1.8. The code has now been
refactored into submodules, where each submodule corresponds with a phase
of the Fonduer pipeline. Consequently, you may need to adjust the paths
of your imports from Fonduer.

Added
^^^^^
* `senwu`_: Add branding, OSX tests.
(`61 <https://github.com/HazyResearch/fonduer/pull/61>`_,
`62 <https://github.com/HazyResearch/fonduer/pull/62>`_)
* `lukehsiao`_: Update the Data Model to include Caption, Section, Paragraph.
(`76 <https://github.com/HazyResearch/fonduer/pull/76>`_,
`77 <https://github.com/HazyResearch/fonduer/pull/77>`_,
`78 <https://github.com/HazyResearch/fonduer/pull/78>`_)

Changed
^^^^^^^
* `senwu`_: Split up lf_helpers into separate files for each modality.
(`81 <https://github.com/HazyResearch/fonduer/pull/81>`_)
* `lukehsiao`_: Rename to Phrase to Sentence.
(`72 <https://github.com/HazyResearch/fonduer/pull/72>`_)
* `lukehsiao`_: Split models and preprocessors into individual files.
(`60 <https://github.com/HazyResearch/fonduer/pull/60>`_,
`64 <https://github.com/HazyResearch/fonduer/pull/64>`_)

Removed
^^^^^^^
* `lukehsiao`_: Remove the futures imports, truly making Fonduer Python 3
only. Also reorganize the codebase into submodules for each pipeline phase.
(`59 <https://github.com/HazyResearch/fonduer/pull/59>`_)

Fixed
^^^^^
* A variety of small bugfixes and code cleanup.
(`view milestone <https://github.com/HazyResearch/fonduer/milestone/8>`_)

0.1.8

-------------------

Added
^^^^^
* `prabh06`_: Extend styles parsing and add regex search
(`52 <https://github.com/HazyResearch/fonduer/pull/52>`_)

Removed
^^^^^^^
* `senwu`_: Remove the Viewer, which is unused in Fonduer
(`55 <https://github.com/HazyResearch/fonduer/pull/55>`_)
* `lukehsiao`_: Remove unnecessary encoding in __repr__
(`50 <https://github.com/HazyResearch/fonduer/pull/50>`_)

Fixed
^^^^^
* `senwu`_: Fix SimpleTokenizer for lingual features are disabled
(`53 <https://github.com/HazyResearch/fonduer/pull/53>`_)
* `lukehsiao`_: Fix LocationMatch NER tags for spaCy
(`50 <https://github.com/HazyResearch/fonduer/pull/50>`_)

0.1.7

-------------------

.. warning::
This release is NOT backwards compatable with v0.1.6. Specifically, the
``snorkel`` submodule in fonduer has been removed. Any previous imports of
the form:

.. code:: python

from fonduer.snorkel._ import _

Should drop the ``snorkel`` submodule:

.. code:: python

from fonduer._ import _

.. tip::
To leverage the logging output of Fonduer, such as in a Jupyter Notebook,
you can configure a logger in your application:

.. code:: python

import logging

logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

Added
^^^^^
* `lukehsiao`_: Add lf_helpers to ReadTheDocs
(`42 <https://github.com/HazyResearch/fonduer/pull/42>`_)

Removed
^^^^^^^
* `lukehsiao`_: Remove SQLite code, switch to logging, and absorb snorkel
codebase directly into the fonduer package for simplicity
(`44 <https://github.com/HazyResearch/fonduer/pull/44>`_)
* `lukehsiao`_: Remove unused package dependencies
(`41 <https://github.com/HazyResearch/fonduer/pull/41>`_)

0.1.6

-------------------

Changed
^^^^^^^
* `lukehsiao`_: Switch README from Markdown to reStructuredText

Fixed
^^^^^
* `senwu`_: Fix support for providing a PostgreSQL username and password as
part of the connection string provided to Meta.init()
(`40 <https://github.com/HazyResearch/fonduer/pull/40>`_)

Fonduer

Page 4 of 5