Software libraries and tools

Machine learning and statistical toolkits

Software in this section implements machine learning and inference strategies that can be applied to different problems.

  • gensim in Python, topic modelling and clustering, TF-IDF, LSA, LDA, random projections, handles input larger than RAM. Licence: LGPL.
  • LingPipe in Java, language models, classification, topic modelling and clustering, tutorials for NER, sentiment analysis, part-of-speech tagging, sentence boundaries, and others. Licence: commercial, with free research use if you make your data freely available.
  • SVM-Light in C (usually compiled and called from the command line). Support Vector Machines. Licence: commercial, with free research use.
  • Weka in Java with GUI. Data pre-processing, classification, regression, clustering, association rules, and visualization. Licence: GPL
  • n-gram statistics package (NSP) in Perl. Standard tests of association for n-grams, including Fisher's exact test, the log likelihood ratio, Pearson's chi-squared test. Licence: GPL

Domain-specific software

Software in this section is designed to be used on particular classes of problem, eg, machine translation.

Machine translation

Tagging

  • Brill tagger : part of speech tagging
  • C&C tools (for CCG). Licence: commercial, free for non-commercial use.

Parsing

  • Minipar. Licence: commercial, free for non-commercial use.
  • Multilingual Statistical Parsing Engine by Dan Bikel (also known as the "Bikel parser"), includes many parsing modes including an emulation of the Collins parser. Licence: free for research purposes.
  • OpenCCG
  • C&C tools (for CCG). Licence: commercial, free for non-commercial use.

Semantics

  • Boxer: input is CCG derivations, output is Discourse Representation Structures. Licence: commercial, free for non-commercial use.

Natural Language Generation

  • XLE
  • MATE
  • OpenCCG
  • KPML
  • SimpleNLG
  • OpenCCG

Morphology

  • XFST
  • SFST
  • No labels