-------------------
Added
^^^^^
* `senwu`_: Support CSV, TSV, Text input data format.
For CSV format, ``CSVDocPreprocessor`` treats each line in the input file as
a document. It assumes that each column is one section and content in each
column as one paragraph as default. However, if the column is complex, an
advanced parser may be used by specifying ``parser_rule`` parameter in a dict
format where key is the column index and value is the specific parser.
.. note::
In Fonduer v0.5.0, you can use ``CSVDocPreprocessor``:
.. code:: python
from fonduer.parser import Parser
from fonduer.parser.preprocessors import CSVDocPreprocessor
from fonduer.utils.utils_parser import column_constructor
max_docs = 10
Define specific parser for the third column (index 2), which takes ``text``,
``name=None``, ``type="text"``, and ``delim=None`` as input and generate
``(content type, content name, content)`` for ``build_node``
in ``fonduer.utils.utils_parser``.
parser_rule = {
2: partial(column_constructor, type="figure"),
}
doc_preprocessor = CSVDocPreprocessor(
PATH_TO_DOCS, max_docs=max_docs, header=True, parser_rule=parser_rule
)
corpus_parser = Parser(session, structural=True, lingual=True, visual=False)
corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
all_docs = corpus_parser.get_documents()
For TSV format, ``TSVDocPreprocessor`` assumes each line in input file as a
document which should follow (doc_name <tab> doc_text) format.
For Text format, ``TextDocPreprocessor`` assumes one document per file.
Changed
^^^^^^^
* `senwu`_: Reorganize ``learning`` module to use pytorch dataloader, include
``MultiModalDataset`` to better handle multimodal information, and simplify
the code
* `senwu`_: Remove ``batch_size`` input argument from ``_calc_logits``,
``marginals``, ``predict``, and ``score`` in ``Classifier``
* `senwu`_: Rename ``predictions`` to ``predict`` in ``Classifier`` and update
the input arguments to have ``pos_label`` (assign positive label for binary class
prediction) and ``return_probs`` (If True, return predict probablities as well)
* `senwu`_: Update ``score`` function in ``Classifier`` to include:
(1) For binary: precision, recall, F-beta score, accuracy, ROC-AUC score;
(2) For categorical: accuracy;
* `senwu`_: Remove ``LabelBalancer``
* `senwu`_: Remove original ``Classifier`` class, rename ``NoiseAwareModel`` to
``Classifier`` and use the same setting for both binary and multi-class classifier
* `senwu`_: Unify the loss (``SoftCrossEntropyLoss``) for all settings
* `senwu`_: Rename ``layers`` in learning module to ``modules``
* `senwu`_: Update code to use Python 3.6+'s f-strings
* `HiromuHota`_: Reattach doc with the current session at
MentionExtractorUDFapply to avoid doing so at each MentionSpace.
Fixed
^^^^^
* `HiromuHota`_: Modify docstring of functions that return get_sparse_matrix
* `lukehsiao`_: Fix the behavior of ``get_last_documents`` to return Documents
that are correctly linked to the database and can be navigated by the user.
(`201 <https://github.com/HazyResearch/fonduer/pull/201>`_)
* `lukehsiao`_: Fix the behavior of MentionExtractor ``clear`` and
``clear_all`` to also delete the Candidates that correspond to the Mentions.