-------------------
Added
^^^^^
* `lukehsiao`_: Add supporting functions for incremental knowledge base
construction. (`154 <https://github.com/HazyResearch/fonduer/pull/154>`_)
* `j-rausch`_: Added alpha spacy support for Japanese tokenizer.
* `senwu`_: Add sparse logistic regression support.
* `senwu`_: Support Python 3.7.
* `lukehsiao`_: Allow user to change featurization settings by providing
``.fonduer-config.yaml`` in their project.
* `lukehsiao`_: Add a new Mention object, and have Candidate objects be
composed of Mention objects, rather than directly of Spans. This allows a
single Mention to be reused in multiple relations.
* `lukehsiao`_: Improved connection-string validation for the Meta class.
Changed
^^^^^^^
* `j-rausch`_: ``Document.text`` now returns the modified document text, based
on the user-defined html-tag stripping in the parsing stage.
* `j-rausch`_: ``Ngrams`` now has a ``n_min`` argument to specify a minimum
number of tokens per extracted n-gram.
* `lukehsiao`_: Rename ``BatchLabelAnnotator`` to ``Labeler`` and
``BatchFeatureAnnotator`` to ``Featurizer``. The classes now support multiple
relations.
* `j-rausch`_: Made spacy tokenizer to default tokenizer, as long as there
is (alpha) support for the chosen language. lingual argument now
specifies whether additional spacy NLP processing shall be performed.
* `senwu`_: Reorganize the disc model structure.
(`126 <https://github.com/HazyResearch/fonduer/pull/126>`_)
* `lukehsiao`_: Add ``session`` and ``parallelism`` as a parameter to all UDF
classes.
* `j-rausch`_: Sentence splitting in lingual mode is now performed by
spacy's sentencizer instead of the dependency parser. This can lead to
variations in sentence segmentation and tokenization.
* `j-rausch`_: Added ``language`` argument to ``Parser`` for specification
of language used by ``spacy_parser``. E.g. ``language='en'.
* `senwu`_: Change weak supervision learning framework from numbskull to
`MeTaL <https://github.com/HazyResearch/metal>_`.
(`119 <https://github.com/HazyResearch/fonduer/pull/119>`_)
* `senwu`_: Change learning framework from Tensorflow to PyTorch.
(`115 <https://github.com/HazyResearch/fonduer/pull/115>`_)
* `lukehsiao`_: Blacklist <script> nodes by default when parsing HTML docs.
* `lukehsiao`_: Reorganize ReadTheDocs structure to mirror the repository
structure. Now, each pipeline phase's user-facing API is clearly shown.
* `lukehsiao`_: Rather than importing ambiguously from ``fonduer`` directly,
disperse imports into their respective pipeline phases. This eliminates
circular dependencies, and makes imports more explicit and clearer to the
user where each import is originating from.
* `lukehsiao`_: Provide debug logging of external subprocess calls.
* `lukehsiao`_: Use ``tdqm`` for progress bar (including multiprocessing).
* `lukehsiao`_: Set the default PostgreSQL client encoding to "utf8".
* `lukehsiao`_: Organize documentation for ``data_model_utils`` by modality.
(`85 <https://github.com/HazyResearch/fonduer/pull/85>`_)
* `lukehsiao`_: Rename ``lf_helpers`` to ``data_model_utils``, since they can
be applied more generally to throttlers or used for error analysis, and are
not limited to just being used in labeling functions.
* `lukehsiao`_: Update the CHANGELOG to start following `KeepAChangelog
<https://keepachangelog.com/en/1.0.0/>`_ conventions.
Removed
^^^^^^^
* `lukehsiao`_: Remove the XMLMultiDocPreprocessor.
* `lukehsiao`_: Remove the ``reduce`` option for UDFs, which were unused.
* `lukehsiao`_: Remove get parent/children/sentence generator from Context.
(`87 <https://github.com/HazyResearch/fonduer/pull/87>`_)
* `lukehsiao`_: Remove dependency on ``pdftotree``, which is currently unused.
Fixed
^^^^^
* `j-rausch`_: Improve ``spacy_parser`` performance. We split the lingual
parsing pipeline into two stages. First, we parse structure and gather all
sentences for a document. Then, we merge and feed all sentences per document
into the spacy NLP pipeline for more efficient processing.
* `senwu`_: Speed-up of ``_get_node`` using caching.
* `HiromuHota`_: Fixed bug with Ngram splitting and empty TemporarySpans.
(`108 <https://github.com/HazyResearch/fonduer/pull/108>`_,
`112 <https://github.com/HazyResearch/fonduer/pull/112>`_)
* `lukehsiao`_: Fixed PDF path validation when using ``visual=True`` during
parsing.
* `lukehsiao`_: Fix Meta bug which would not switch databases when init() was
called with a new connection string.
.. note::
With the addition of Mentions, the process of Candidate extraction has
changed. In Fonduer v0.2.3, Candidate extraction was as follows:
.. code:: python
candidate_extractor = CandidateExtractor(PartAttr,
[part_ngrams, attr_ngrams],
[part_matcher, attr_matcher],
candidate_filter=candidate_filter)
candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)
With this release, you will now first extract Mentions and then extract
Candidates based on those Mentions:
.. code:: python
Mention Extraction
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
temp_ngrams = MentionNgramsTemp(n_max=2)
volt_ngrams = MentionNgramsVolt(n_max=1)
Part = mention_subclass("Part")
Temp = mention_subclass("Temp")
Volt = mention_subclass("Volt")
mention_extractor = MentionExtractor(
session,
[Part, Temp, Volt],
[part_ngrams, temp_ngrams, volt_ngrams],
[part_matcher, temp_matcher, volt_matcher],
)
mention_extractor.apply(docs, split=0, parallelism=PARALLEL)
Candidate Extraction
PartTemp = candidate_subclass("PartTemp", [Part, Temp])
PartVolt = candidate_subclass("PartVolt", [Part, Volt])
candidate_extractor = CandidateExtractor(
session,
[PartTemp, PartVolt],
throttlers=[temp_throttler, volt_throttler]
)
candidate_extractor.apply(docs, split=0, parallelism=PARALLEL)
Furthermore, because Candidates are now composed of Mentions rather than
directly of Spans, to get the Span object from a mention, use the ``.span``
attribute of a Mention.
.. note::
Fonduer has been reorganized to require more explicit import syntax. In
Fonduer v0.2.3, nearly everything was imported directly from fonduer:
.. code:: python
from fonduer import (
CandidateExtractor,
DictionaryMatch,
Document,
FeatureAnnotator,
GenerativeModel,
HTMLDocPreprocessor,
Intersect,
LabelAnnotator,
LambdaFunctionMatcher,
MentionExtractor,
Meta,
Parser,
RegexMatchSpan,
Sentence,
SparseLogisticRegression,
Union,
candidate_subclass,
load_gold_labels,
mention_subclass,
)
With this release, you will now import from each pipeline phase. This makes
imports more explicit and allows you to more clearly see which pipeline
phase each import is associated with:
.. code:: python
from fonduer import Meta
from fonduer.candidates import CandidateExtractor, MentionExtractor
from fonduer.candidates.matchers import (
DictionaryMatch,
Intersect,
LambdaFunctionMatcher,
RegexMatchSpan,
Union,
)
from fonduer.candidates.models import candidate_subclass, mention_subclass
from fonduer.features import Featurizer
from metal.label_model import LabelModel GenerativeModel in v0.2.3
from fonduer.learning import SparseLogisticRegression
from fonduer.parser import Parser
from fonduer.parser.models import Document, Sentence
from fonduer.parser.preprocessors import HTMLDocPreprocessor
from fonduer.supervision import Labeler, get_gold_labels