Fonduer

Latest version: v0.8.3

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 5

0.6.2

-------------------

Fixed
^^^^^
* `lukehsiao`_: Fix Meta initialization bug which would configure logging
upon import rather than allowing the user to configure logging themselves.

0.6.1

-------------------

Added
^^^^^
* `senwu`_: update the spacy version to v2.1.x.
* `lukehsiao`_: provide ``fonduer.init_logging()`` as a way to configure
logging to a temp directory by default.

.. note::

Although you can still configure ``logging`` manually, with this change
we also provide a function for initializing logging. For example, you
can call:

.. code:: python

import logging
import fonduer

Optionally configure logging
fonduer.init_logging(
log_dir="log_folder",
format="[%(asctime)s][%(levelname)s] %(name)s:%(lineno)s - %(message)s",
level=logging.INFO
)

session = fonduer.Meta.init(conn_string).Session()

which will create logs within the ``log_folder`` directory. If logging is
not explicitly initialized, we will provide a default configuration which
will store logs in a temporary directory.

Changed
^^^^^^^
* `senwu`_: Update the whole logging strategy.

.. note::
For the whole logging strategy:

With this change, the running log is stored ``fonduer.log`` in the
``{fonduer.Meta.log_path}/{datetime}`` folder. User can specify it
using ``fonduer.init_logging()``. It also contains the learning logs init.

For learning logging strategy:

Previously, the model checkpoints are stored in the user provided folder
by ``save_dir`` and the name for checkpoint is
``{model_name}.mdl.ckpt.{global_step}``.

With this change, the model is saved in the subfolder of the same folder
``fonduer.Meta.log_path`` with log file file. Each learning run creates a
subfolder under name ``{datetime}_{model_name}`` with all model checkpoints
and tensorboard log file init. To use the tensorboard to check the learning
curve, run ``tensorboard --logdir LOG_FOLDER``.

Fixed
^^^^^
* `senwu`_: Change the exception condition to make sure parser run end to end.
* `lukehsiao`_: Fix parser error when text was located in the ``tail`` of an
LXML table node..
* `HiromuHota`_: Store lemmas and pos_tags in case they are returned from a
tokenizer.
* `HiromuHota`_: Use unidic instead of ipadic for Japanese.
(`231 <https://github.com/HazyResearch/fonduer/issues/231>`_)
* `senwu`_: Use mecab-python3 version 0.7 for Japanese tokenization since
spaCy only support version 0.7.
* `HiromuHota`_: Use black 18.9b0 or higher to be consistent with isort.
(`225 <https://github.com/HazyResearch/fonduer/issues/225>`_)
* `HiromuHota`_: Workaround no longer required for Japanese as of spaCy v2.1.0.
(`224 <https://github.com/HazyResearch/fonduer/pull/224>`_)
* `senwu`_: Update the metal version.
* `senwu`_: Expose the ``b`` and ``pos_label`` in training.
* `senwu`_: Fix the issue that pdfinfo causes parsing error when it contains
more than one ``Page``.

0.6.0

-------------------

Changed
^^^^^^^
* `lukehsiao`_: improved performance of ``data_model_utils`` through caching
and simplifying the underlying queries.
(`212 <https://github.com/HazyResearch/fonduer/pull/212>`_,
`215 <https://github.com/HazyResearch/fonduer/pull/215>`_)
* `senwu`_: upgrade to PyTorch v1.0.0.
(`209 <https://github.com/HazyResearch/fonduer/pull/209>`_)

Removed
^^^^^^^
* `lukehsiao`_: Removed the redundant ``get_gold_labels`` function.

.. note::

Rather than calling get_gold_labels directly, call it from the Labeler:

.. code:: python

from fonduer.supervision import Labeler
labeler = Labeler(session, [relations])
L_gold_train = labeler.get_gold_labels(train_cands, annotator='gold')

Rather than:

.. code:: python

from fonduer.supervision import Labeler, get_gold_labels
labeler = Labeler(session, [relations])
L_gold_train = get_gold_labels(session, train_cands, annotator_name='gold')

Fixed
^^^^^
* `senwu`_: Improve type checking in featurization.
* `lukehsiao`_: Fixed sentence.sentence_num bug in get_neighbor_sentence_ngrams.
* `lukehsiao`_: Add session synchronization to sqlalchemy delete queries.
(`214 <https://github.com/HazyResearch/fonduer/pull/214>`_)
* `lukehsiao`_: Update PyYAML dependency to patch CVE-2017-18342.
(`205 <https://github.com/HazyResearch/fonduer/pull/205>`_)
* `KenSugimoto`_: Fix max/min in ``visualizer.get_box``

0.5.0

-------------------

Added
^^^^^
* `senwu`_: Support CSV, TSV, Text input data format.
For CSV format, ``CSVDocPreprocessor`` treats each line in the input file as
a document. It assumes that each column is one section and content in each
column as one paragraph as default. However, if the column is complex, an
advanced parser may be used by specifying ``parser_rule`` parameter in a dict
format where key is the column index and value is the specific parser.

.. note::

In Fonduer v0.5.0, you can use ``CSVDocPreprocessor``:

.. code:: python

from fonduer.parser import Parser
from fonduer.parser.preprocessors import CSVDocPreprocessor
from fonduer.utils.utils_parser import column_constructor

max_docs = 10

Define specific parser for the third column (index 2), which takes ``text``,
``name=None``, ``type="text"``, and ``delim=None`` as input and generate
``(content type, content name, content)`` for ``build_node``
in ``fonduer.utils.utils_parser``.
parser_rule = {
2: partial(column_constructor, type="figure"),
}

doc_preprocessor = CSVDocPreprocessor(
PATH_TO_DOCS, max_docs=max_docs, header=True, parser_rule=parser_rule
)

corpus_parser = Parser(session, structural=True, lingual=True, visual=False)
corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

all_docs = corpus_parser.get_documents()

For TSV format, ``TSVDocPreprocessor`` assumes each line in input file as a
document which should follow (doc_name <tab> doc_text) format.

For Text format, ``TextDocPreprocessor`` assumes one document per file.

Changed
^^^^^^^
* `senwu`_: Reorganize ``learning`` module to use pytorch dataloader, include
``MultiModalDataset`` to better handle multimodal information, and simplify
the code
* `senwu`_: Remove ``batch_size`` input argument from ``_calc_logits``,
``marginals``, ``predict``, and ``score`` in ``Classifier``
* `senwu`_: Rename ``predictions`` to ``predict`` in ``Classifier`` and update
the input arguments to have ``pos_label`` (assign positive label for binary class
prediction) and ``return_probs`` (If True, return predict probablities as well)
* `senwu`_: Update ``score`` function in ``Classifier`` to include:
(1) For binary: precision, recall, F-beta score, accuracy, ROC-AUC score;
(2) For categorical: accuracy;
* `senwu`_: Remove ``LabelBalancer``
* `senwu`_: Remove original ``Classifier`` class, rename ``NoiseAwareModel`` to
``Classifier`` and use the same setting for both binary and multi-class classifier
* `senwu`_: Unify the loss (``SoftCrossEntropyLoss``) for all settings
* `senwu`_: Rename ``layers`` in learning module to ``modules``
* `senwu`_: Update code to use Python 3.6+'s f-strings
* `HiromuHota`_: Reattach doc with the current session at
MentionExtractorUDFapply to avoid doing so at each MentionSpace.

Fixed
^^^^^
* `HiromuHota`_: Modify docstring of functions that return get_sparse_matrix
* `lukehsiao`_: Fix the behavior of ``get_last_documents`` to return Documents
that are correctly linked to the database and can be navigated by the user.
(`201 <https://github.com/HazyResearch/fonduer/pull/201>`_)
* `lukehsiao`_: Fix the behavior of MentionExtractor ``clear`` and
``clear_all`` to also delete the Candidates that correspond to the Mentions.

0.4.1

-------------------

Added
^^^^^
* `senwu`_: Added alpha spacy support for Chinese tokenizer.

Changed
^^^^^^^
* `lukehsiao`_: Add soft version pinning to avoid failures due to dependency
API changes.
* `j-rausch`_: Change ``get_row_ngrams`` and ``get_col_ngrams`` to return
``None`` if the passed ``Mention`` argument is not inside a table.
(`194 <https://github.com/HazyResearch/fonduer/pull/194>`_)

Fixed
^^^^^
* `senwu`_: fix non-deterministic issue from get_candidates and get_mentions
by parallel candidate/mention generation.

0.4.0

-------------------

Added
^^^^^
* `senwu`_: Rename ``span`` attribute to ``context`` in mention_subclass to
better support mulitmodal mentions.
(`184 <https://github.com/HazyResearch/fonduer/pull/184>`_)

.. note::
The way to retrieve corresponding data model object from mention changed.
In Fonduer v0.3.6, we use ``.span``:

.. code:: python

sent_mention is a SentenceMention
sentence = sent_mention.span.sentence

With this release, we use ``.context``:

.. code:: python

sent_mention is a SentenceMention
sentence = sent_mention.context.sentence

* `senwu`_: Add support to extract multimodal candidates and add
``DoNothingMatcher`` matcher.
(`184 <https://github.com/HazyResearch/fonduer/pull/184>`_)

.. note::
The Mention extraction support all data types in data model. In Fonduer
v0.3.6, Mention extraction only supports ``MentionNgrams`` and
``MentionFigures``:

.. code:: python

from fonduer.candidates import (
MentionFigures,
MentionNgrams,
)

With this release, it supports all data types:

.. code:: python

from fonduer.candidates import (
MentionCaptions,
MentionCells,
MentionDocuments,
MentionFigures,
MentionNgrams,
MentionParagraphs,
MentionSections,
MentionSentences,
MentionTables,
)

* `senwu`_: Add support to parse multiple sections in parser, fix webpage
context, and add name column for each context in data model.
(`182 <https://github.com/HazyResearch/fonduer/pull/182>`_)

Fixed
^^^^^
* `senwu`_: Remove unnecessary backref in mention generation.
* `j-rausch`_: Improve error handling for invalid row spans.
(`183 <https://github.com/HazyResearch/fonduer/pull/183>`_)

Page 2 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.