SlideShare une entreprise Scribd logo
1  sur  57
Document Classification
In PHP


        @ianbarber - ian@ibuildings.com.......
             http://joind.in/talk/view/587.......
Document Classification


Defining The Task
Document Pre-processing
Term Selection
Algorithms
What is
Document Classification?
Uses



 Ian Barber / @ianbarber / ian@ibuildings.com......
 Filter          Organise           Metadata
Filtering -
Binary Classification
Organising -....
Single Label Classification....
Metadata -
Multiple Label Classification
Manual Rules Written
Domain Experts
Machine Learning -.....
Automatically Extract Rules.....
Classes




 Training        Test
Documents     Documents
Evaluation

                 spam       ham

                 true       false
         spam
                positive   positive
                  false      true
         ham
                negative   negative
Measures....

$accuracy    =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision   = $tp / ($tp + $fp);

$recall      = $tp / ($tp + $fn);
$beta = 0.5;

$f =
  (($beta + 1) * $precision * $recall)
    / (($beta * $precision) + $recall)




                       Fβ Measure....
Vector Space Model -
Bag Of Words
$doc   = strtolower(strip_tags($doc));

$regex = '/[^a-z0-9']/';
$doc   = preg_replace($regex, '', $doc);

$words = preg_split('/s+/', $doc);




Extract Tokens
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew



       i   really like eggs cabbage and donʼt stew

 A    1     1     1    1      0     0    0     0

 B    1     0     1    0      1     1    1     1
2.00




    1.00
i




       0




    -1.00
            0   0.50   1.00     1.50   2.00
                       really
$tf   
       = $termCount / $wordCount;

$idf      
   = log($totalDocs
                    / $docsWithTerm, 2);

$tfidf = $tf * $idf;




                         Term Weighting....
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew



      i really like eggs cabbage and donʼt stew

 A    0 0.25    0 0.25      0      0     0     0

 B    0    0    0    0    0.125 0.125 0.25 0.125
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew



      i really like eggs cabbage and donʼt stew

 A    0   0.5   0   0.5     0      0     0     0

 B    0    0    0    0     0.2    0.2   0.4    0.2
Dimensionality Reduction....
Stop Words....
happening - happen.......
                           happens - happen. .....
                          happened - happen.......
   http://tartarus.org/~martin/PorterStemmer.......




Stemming
spam   ham
 term       $a    $b
not term    $c    $d




           Chi-Square....
$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);

$chisquare   = (
  $total *   pow($diff, 2 ) /
  (($a+$c)   * ($b+$d) *
   ($a+$b)   * ($c+$d));

      Chi-Square 1DF....
p         chi2.
0.1       2.71.
0.05      3.84.
0.01      6.63.
0.005     7.88.
0.001    10.83.


        p - Value....
Decision Tree - ID3

              ?

        ✔             ?

              ✖           ✔
Entropy....

$entropy =
   -( ($spam/$total)
       * log($spam/$total, 2))
   -( ($ham/$total)
       * log($ham/$total, 2));
1.00



          0.75
entropy




          0.50



          0.25



            0
                 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0
                                         spam/total
Information Gain....

    $gain   =
     $baseEntropy
     -(($withCount/$total)* $withEntropy )
(    -(($woutCount/$total)* $woutEntropy )
Split   Entropy Proportion    E*P

 Base     50/50     1         1          1

 With     20/5    0.722      0.25      0.1805

Without   30/45    0.97      0.75      0.7275



        1 - With - Without = 0.092.
function build($tree, $score) {
  if(!$score[2])      { return 'spam' }
  else if(!$score[1]) { return 'ham'; }

    list($trees, $scores, $term) =
                       getMaxGain($tree);

    return array($term => array(
      0 => build($trees[0],$score[0]),
      1 => build($trees[1],$score[1])
    ));
}
array('hello' =>
   array(
     0 => array('terry' =>
        array (
          0 => 'spam',
          1 => array('everybody' =>
            array(
              0 => 'ham',
              1 => 'spam'
            )
          )
        )
     ),
     1 => 'spam'
   )
);
Classification....
function classify($doc, $tree) {
  if(is_string($tree)) {
    return $tree;
  }
  $key = key($tree);
  if(in_array($term, $doc)) {
    return classify($doc, $tree[$key][0]);
  } else {
    return classify($doc, $tree[$key][1]);
  }
}
Overfitting:....
Pruning or Stop Conditions....
K Nearest Neighbour
Spam
Term X




                         Ham


                Term Y
Term X




         Term Y
Term X




         Term Y
foreach($doca as $term => $tfidf) {
    $distance +=
      abs ( $tfidf - $docb[$term] );
}




             Euclidean Distance....
Cosine Similarity....


foreach($doca as $term => $tfidf) {
  $similarity +=
    floatval($tfidf) *
    floatval($docb[$term]);
}
foreach($scores as $s) {
    $classes[$s['class']]++;
}

foreach($scores as $s){
    $classes[$s['class']] += $s['sim'];
}

arsort($classes);
$class = key($classes);


                          Classifying....
Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
  Zend_Search_Lucene_Field::Text(
    'class', $class));
$doc->addField(
  Zend_Search_Lucene_Field::UnStored(
    'contents', $content));
$index->addDocument($doc);
Zend_Search_Lucene::setResultSetLimit(25);

$results = $index->find($content);
foreach($results as $result) {
  $classes[$result->class] += 1;
}

arsort($classes);
$class = key($classes);


           Classifying with ZSL....
Flax/Xapian Search Service
http://www.flax.co.uk.......
$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
  'store'      => true,
  'exacttext’ => true));
$db->addField('contents', array(
  'store'      => false,
  'freetext' => array('language'=>'en')));
$db->commit();

$db->addDocument(array(
  'class'    => $class,
  'contents' => $document));
$db->commit();
$db->addDocument(
        array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$db->deleteDocument('foo');
$db->commit();

foreach($results['results'] as $r) {
  if($r['docid'] != 'foo') {
    $classes[$r['data']['class'][0]] += 1;
  }
}

arsort($classes);
$class = key($classes);
Spam
Term X




                         Ham




                Term Y
Prototypes For Rocchio

$mul = 1 / $docsInClassCount;

foreach($classDocs as $tid => $tfidf) {
    $prototype[$tid] += $mul * $tfidf;
}
Naive Bayes -
Probability Based Classifier
Bayes Theorem
  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
                           Pr(Doc)



  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
Likelihood Of Term Occurring
Given Class

  word      spam freq   pr(word|spam)   ham freq   pr(word|ham)

 register     1757          0.11          246          0.02

  sent        487           0.03         4600          0.36
Estimating Likelihood
$this->db->query(quot;
   INSERT INTO class_terms
       (class, term, likelihood)
   SELECT d.class, d.term,
       count(*) / quot; . $classCount . quot;
   FROM documents AS d
   JOIN document_terms AS dt USING (did)
   WHERE d.class = 'quot; . $class . quot;'quot;
);
Classifying A Document
foreach($classes as $class) {
  $prob[$class] = 0.5; // assume prior

    foreach($document as $term) {
      $prob[$class] *=
            $likely[$term][$class];
    }
}

arsort($prob);
$class = key($prob);
Document Classification


Defining The Problem
Document Processing
Term Selection
Algorithm
Image Credits
Title          http://www.flickr.com/photos/themacinator/3499579760/
What is...     http://www.flickr.com/photos/austinevan/1225274637/
Filter         http://www.flickr.com/photos/benimoto/2913950616/
Organise       http://www.flickr.com/photos/ellasdad/425813314/
Metadata       http://www.flickr.com/photos/banky177/2282734063/
Manual         http://www.flickr.com/photos/foundphotoslj/1134150364/
Automatic      http://www.flickr.com/photos/29278394@N00/59538978/
Vector Space   http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/
Reduction      http://www.flickr.com/photos/wili/157220657/sizes/l/
Stemming       http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/
Stop words     http://www.flickr.com/photos/afroswede/22237769/
Chi-Squared    http://www.flickr.com/photos/kdkd/2837565850/sizes/o/
ID3            http://www.flickr.com/photos/tonythemisfit/2414239471
Overfitting     http://www.flickr.com/photos/akirkley/3222128726/sizes/l/
Bayes          http://www.flickr.com/photos/darwinbell/440080655/sizes/l/
Conclusion     http://www.flickr.com/photos/mukluk/241256203
Credits        http://www.flickr.com/photos/librarianavengers/413762956/
Questions?



       @ianbarber - ian@ibuildings.com.......
            http://joind.in/talk/view/587.......

Contenu connexe

Tendances

“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonfRafael Dohms
 
Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Jeff Carouth
 
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object CalisthenicsPHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object CalisthenicsGuilherme Blanco
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにYuya Takeyama
 
Object Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHPObject Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHPChad Gray
 
循環参照のはなし
循環参照のはなし循環参照のはなし
循環参照のはなしMasahiro Honma
 
PhpUnit - The most unknown Parts
PhpUnit - The most unknown PartsPhpUnit - The most unknown Parts
PhpUnit - The most unknown PartsBastian Feder
 
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: HackAn Elephant of a Different Colour: Hack
An Elephant of a Different Colour: HackVic Metcalfe
 
You code sucks, let's fix it
You code sucks, let's fix itYou code sucks, let's fix it
You code sucks, let's fix itRafael Dohms
 
The Art of Transduction
The Art of TransductionThe Art of Transduction
The Art of TransductionDavid Stockton
 
20160227 Granma
20160227 Granma20160227 Granma
20160227 GranmaSharon Liu
 
Your code sucks, let's fix it
Your code sucks, let's fix itYour code sucks, let's fix it
Your code sucks, let's fix itRafael Dohms
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersIan Barber
 
Exhibition of Atrocity
Exhibition of AtrocityExhibition of Atrocity
Exhibition of AtrocityMichael Pirnat
 
Mocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnitMocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnitmfrost503
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionIan Barber
 
Taking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsTaking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsDavid Golden
 

Tendances (20)

“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
 
Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4
 
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object CalisthenicsPHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object Calisthenics
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くために
 
Object Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHPObject Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHP
 
PHP and MySQL
PHP and MySQLPHP and MySQL
PHP and MySQL
 
循環参照のはなし
循環参照のはなし循環参照のはなし
循環参照のはなし
 
PhpUnit - The most unknown Parts
PhpUnit - The most unknown PartsPhpUnit - The most unknown Parts
PhpUnit - The most unknown Parts
 
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: HackAn Elephant of a Different Colour: Hack
An Elephant of a Different Colour: Hack
 
You code sucks, let's fix it
You code sucks, let's fix itYou code sucks, let's fix it
You code sucks, let's fix it
 
The Art of Transduction
The Art of TransductionThe Art of Transduction
The Art of Transduction
 
Intoduction to php arrays
Intoduction to php arraysIntoduction to php arrays
Intoduction to php arrays
 
Functional programming with php7
Functional programming with php7Functional programming with php7
Functional programming with php7
 
20160227 Granma
20160227 Granma20160227 Granma
20160227 Granma
 
Your code sucks, let's fix it
Your code sucks, let's fix itYour code sucks, let's fix it
Your code sucks, let's fix it
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
Exhibition of Atrocity
Exhibition of AtrocityExhibition of Atrocity
Exhibition of Atrocity
 
Mocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnitMocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnit
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
Taking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsTaking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order Functions
 

En vedette

Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?azubiaga
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
 
Report file on Web technology(html5 and css3)
Report file on Web technology(html5 and css3)Report file on Web technology(html5 and css3)
Report file on Web technology(html5 and css3)PCG Solution
 
How PHP Works ?
How PHP Works ?How PHP Works ?
How PHP Works ?Ravi Raj
 
Benefits of the CodeIgniter Framework
Benefits of the CodeIgniter FrameworkBenefits of the CodeIgniter Framework
Benefits of the CodeIgniter FrameworkToby Beresford
 
WEB APPLICATION USING PHP AND MYSQL
WEB APPLICATION USING PHP AND MYSQLWEB APPLICATION USING PHP AND MYSQL
WEB APPLICATION USING PHP AND MYSQLAakash Khandelwal
 
Final Web Design Project
Final Web Design ProjectFinal Web Design Project
Final Web Design ProjectJeana Bertoldi
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Document Classification with Neo4j
Document Classification with Neo4jDocument Classification with Neo4j
Document Classification with Neo4jKenny Bastani
 
Web Development on Web Project Report
Web Development on Web Project ReportWeb Development on Web Project Report
Web Development on Web Project ReportMilind Gokhale
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
Web Design Project Report
Web Design Project ReportWeb Design Project Report
Web Design Project ReportMJ Ferdous
 
Online shopping report-6 month project
Online shopping report-6 month projectOnline shopping report-6 month project
Online shopping report-6 month projectGinne yoffe
 

En vedette (20)

Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
 
Strukturovaná data
Strukturovaná dataStrukturovaná data
Strukturovaná data
 
How PHP works
How PHP works How PHP works
How PHP works
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
 
Java script seminar
Java script seminarJava script seminar
Java script seminar
 
Report file on Web technology(html5 and css3)
Report file on Web technology(html5 and css3)Report file on Web technology(html5 and css3)
Report file on Web technology(html5 and css3)
 
How PHP Works ?
How PHP Works ?How PHP Works ?
How PHP Works ?
 
Benefits of the CodeIgniter Framework
Benefits of the CodeIgniter FrameworkBenefits of the CodeIgniter Framework
Benefits of the CodeIgniter Framework
 
WEB APPLICATION USING PHP AND MYSQL
WEB APPLICATION USING PHP AND MYSQLWEB APPLICATION USING PHP AND MYSQL
WEB APPLICATION USING PHP AND MYSQL
 
Codeigniter
CodeigniterCodeigniter
Codeigniter
 
Final Web Design Project
Final Web Design ProjectFinal Web Design Project
Final Web Design Project
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Document Classification with Neo4j
Document Classification with Neo4jDocument Classification with Neo4j
Document Classification with Neo4j
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Project Report
Project ReportProject Report
Project Report
 
Web Development on Web Project Report
Web Development on Web Project ReportWeb Development on Web Project Report
Web Development on Web Project Report
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Web Design Project Report
Web Design Project ReportWeb Design Project Report
Web Design Project Report
 
Online shopping report-6 month project
Online shopping report-6 month projectOnline shopping report-6 month project
Online shopping report-6 month project
 

Similaire à Document Classification In PHP

Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnIan Barber
 
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011Masahiro Nagano
 
Top 10 php classic traps
Top 10 php classic trapsTop 10 php classic traps
Top 10 php classic trapsDamien Seguy
 
Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)Michael Schwern
 
Php tips-and-tricks4128
Php tips-and-tricks4128Php tips-and-tricks4128
Php tips-and-tricks4128PrinceGuru MS
 
Functional Pe(a)rls version 2
Functional Pe(a)rls version 2Functional Pe(a)rls version 2
Functional Pe(a)rls version 2osfameron
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksNate Abele
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistenceHugo Hamon
 
Algorithm, Review, Sorting
Algorithm, Review, SortingAlgorithm, Review, Sorting
Algorithm, Review, SortingRowan Merewood
 
PHP Functions & Arrays
PHP Functions & ArraysPHP Functions & Arrays
PHP Functions & ArraysHenry Osborne
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Kang-min Liu
 
Top 10 php classic traps confoo
Top 10 php classic traps confooTop 10 php classic traps confoo
Top 10 php classic traps confooDamien Seguy
 
Typed Properties and more: What's coming in PHP 7.4?
Typed Properties and more: What's coming in PHP 7.4?Typed Properties and more: What's coming in PHP 7.4?
Typed Properties and more: What's coming in PHP 7.4?Nikita Popov
 

Similaire à Document Classification In PHP (20)

Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
 
Top 10 php classic traps
Top 10 php classic trapsTop 10 php classic traps
Top 10 php classic traps
 
Smelling your code
Smelling your codeSmelling your code
Smelling your code
 
Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)Simple Ways To Be A Better Programmer (OSCON 2007)
Simple Ways To Be A Better Programmer (OSCON 2007)
 
Php tips-and-tricks4128
Php tips-and-tricks4128Php tips-and-tricks4128
Php tips-and-tricks4128
 
Daily notes
Daily notesDaily notes
Daily notes
 
Functional Pe(a)rls version 2
Functional Pe(a)rls version 2Functional Pe(a)rls version 2
Functional Pe(a)rls version 2
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate Frameworks
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistence
 
Algorithm, Review, Sorting
Algorithm, Review, SortingAlgorithm, Review, Sorting
Algorithm, Review, Sorting
 
Modern Perl
Modern PerlModern Perl
Modern Perl
 
PHP Functions & Arrays
PHP Functions & ArraysPHP Functions & Arrays
PHP Functions & Arrays
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)
 
Top 10 php classic traps confoo
Top 10 php classic traps confooTop 10 php classic traps confoo
Top 10 php classic traps confoo
 
PHPSpec BDD for PHP
PHPSpec BDD for PHPPHPSpec BDD for PHP
PHPSpec BDD for PHP
 
Typed Properties and more: What's coming in PHP 7.4?
Typed Properties and more: What's coming in PHP 7.4?Typed Properties and more: What's coming in PHP 7.4?
Typed Properties and more: What's coming in PHP 7.4?
 
Intermediate PHP
Intermediate PHPIntermediate PHP
Intermediate PHP
 
ddd+scala
ddd+scaladdd+scala
ddd+scala
 

Plus de Ian Barber

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giantsIan Barber
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleIan Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionIan Barber
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionIan Barber
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The AnswerIan Barber
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment TacticsIan Barber
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)Ian Barber
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & ToolsIan Barber
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)Ian Barber
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search Ian Barber
 

Plus de Ian Barber (10)

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & Tools
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search
 

Dernier

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Dernier (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Document Classification In PHP

  • 1. Document Classification In PHP @ianbarber - ian@ibuildings.com....... http://joind.in/talk/view/587.......
  • 2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  • 4. Uses Ian Barber / @ianbarber / ian@ibuildings.com...... Filter Organise Metadata
  • 6. Organising -.... Single Label Classification....
  • 7. Metadata - Multiple Label Classification
  • 10. Classes Training Test Documents Documents
  • 11. Evaluation spam ham true false spam positive positive false true ham negative negative
  • 12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  • 13. $beta = 0.5; $f = (($beta + 1) * $precision * $recall) / (($beta * $precision) + $recall) Fβ Measure....
  • 14. Vector Space Model - Bag Of Words
  • 15. $doc = strtolower(strip_tags($doc)); $regex = '/[^a-z0-9']/'; $doc = preg_replace($regex, '', $doc); $words = preg_split('/s+/', $doc); Extract Tokens
  • 16. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  • 17. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  • 18. $tf = $termCount / $wordCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  • 19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.25 0 0.25 0 0 0 0 B 0 0 0 0 0.125 0.125 0.25 0.125
  • 20. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.5 0 0.5 0 0 0 0 B 0 0 0 0 0.2 0.2 0.4 0.2
  • 23. happening - happen....... happens - happen. ..... happened - happen....... http://tartarus.org/~martin/PorterStemmer....... Stemming
  • 24. spam ham term $a $b not term $c $d Chi-Square....
  • 25. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  • 26. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  • 27. Decision Tree - ID3 ? ✔ ? ✖ ✔
  • 28. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  • 29. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  • 30. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  • 31. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  • 32. function build($tree, $score) { if(!$score[2]) { return 'spam' } else if(!$score[1]) { return 'ham'; } list($trees, $scores, $term) = getMaxGain($tree); return array($term => array( 0 => build($trees[0],$score[0]), 1 => build($trees[1],$score[1]) )); }
  • 33. array('hello' => array( 0 => array('terry' => array ( 0 => 'spam', 1 => array('everybody' => array( 0 => 'ham', 1 => 'spam' ) ) ) ), 1 => 'spam' ) );
  • 34. Classification.... function classify($doc, $tree) { if(is_string($tree)) { return $tree; } $key = key($tree); if(in_array($term, $doc)) { return classify($doc, $tree[$key][0]); } else { return classify($doc, $tree[$key][1]); } }
  • 37. Spam Term X Ham Term Y
  • 38. Term X Term Y
  • 39. Term X Term Y
  • 40. foreach($doca as $term => $tfidf) { $distance += abs ( $tfidf - $docb[$term] ); } Euclidean Distance....
  • 41. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  • 42. foreach($scores as $s) { $classes[$s['class']]++; } foreach($scores as $s){ $classes[$s['class']] += $s['sim']; } arsort($classes); $class = key($classes); Classifying....
  • 43. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  • 44. Zend_Search_Lucene::setResultSetLimit(25); $results = $index->find($content); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes); Classifying with ZSL....
  • 46. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  • 47. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  • 48. Spam Term X Ham Term Y
  • 49. Prototypes For Rocchio $mul = 1 / $docsInClassCount; foreach($classDocs as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; }
  • 50. Naive Bayes - Probability Based Classifier
  • 51. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  • 52. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  • 53. Estimating Likelihood $this->db->query(quot; INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / quot; . $classCount . quot; FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = 'quot; . $class . quot;'quot; );
  • 54. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  • 55. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  • 56. Image Credits Title http://www.flickr.com/photos/themacinator/3499579760/ What is... http://www.flickr.com/photos/austinevan/1225274637/ Filter http://www.flickr.com/photos/benimoto/2913950616/ Organise http://www.flickr.com/photos/ellasdad/425813314/ Metadata http://www.flickr.com/photos/banky177/2282734063/ Manual http://www.flickr.com/photos/foundphotoslj/1134150364/ Automatic http://www.flickr.com/photos/29278394@N00/59538978/ Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/ Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/ Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/ Stop words http://www.flickr.com/photos/afroswede/22237769/ Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/ ID3 http://www.flickr.com/photos/tonythemisfit/2414239471 Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/ Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/ Conclusion http://www.flickr.com/photos/mukluk/241256203 Credits http://www.flickr.com/photos/librarianavengers/413762956/
  • 57. Questions? @ianbarber - ian@ibuildings.com....... http://joind.in/talk/view/587.......

Notes de l'éditeur

  1. Hello
  2. This is just a quick overview of what we’ll be talking about today
  3. Lots of python/java no PHP classifiers Is it too hardcore? No, algorithms are easy. Widely applicable. So what is it? Assign documents labels from predefined set. Labels can be anything - topic words, non-topic words, metadata whatever Documents in this case is text, web pages, emails, books But it can be really anything as long as you can extract features from it
  4. Classification is really organising of information - do it every day Lots of uses, these are main ones according to me. Might do all three with uploading photos to flickr or facebook Filter, get rid of bad ones. Organise, upload to album or set Tag photos with people in them etc.
  5. Filtering is Class OR Not Class - generally you then hide or remove one lot Binary classification - can break down most things to series of In flickr example, what is good? - photographer, composition, light etc. - some people, friends look good - some people, friends look bad
  6. Organising is putting document in one place - one label chosen from a set of many possible Single label only (often EXACTLY 1, 0 not allowed) Folders, albums, libraries, handwriting recognition
  7. Tagging, can have multiple, often 0 - many labels Often for tagging topics in content E.g. a us-china embargo WTO talk might be filed under, US, China, Trade
  8. 80’s people would come up with rules Then computers would apply rules IF this word AND this WORD then this category Took a lot of time Needed knowledge engineer to get knowledge out of expert into rules Didn’t scale, needed more experts for new categories Subjective - experts disagree Usually result was 60%-90% accurate
  9. Machine Learning people said - ‘look at data’ - Supervised Learning Work out rules based on manually classified examples Scales better, is cheaper, and about as accurate! Only need people to make examples, don’t have to be able to explain their process Look at the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
  10. So what do you need? 1. the classes to classify to 2. A set of manually classified documents to train the classifier on 3. A set of manually classified docs to test on In some cases may have a third set of docs for validation
  11. So how do we test? Run the test docs through, and compare manual to automatic judgements Here we’ve got a binary classification, for a spam checker Top is the manual judgement, vertical is classifier judgement Boxes will just be counts of judgements Some classifiers give a graded result, some give a yes/no result. For graded, we might take the top N judgements, or have a threshold they must achieve Either way, in the end we get down to a judgement
  12. With that we can calculate some numbers Accuracy is just correct percentage - not always useful, as we sometimes bias, e.g. FN over FP with spam Precision measures how tight our grouping is - how much can we trust a positive result being really positive Recall measures what percentage of the available positives we captures You can have one without the other, if you reject all but the ones you’re most sure about, you get good precision if you mark all positive everything you have a great recall
  13. Because of the balance between of recall and precision, researchers often quote breakeven point This is just where recall and precision are equal F is a more advanced measure, measuring the overlap between the two sets F-Beta just allows weighing precision more than recall, or vice versa. If beta = 0.5, recall is half as important as precision, such as with spam checker If beta = 1, then both are equally important There is also an E measure which is just it’s complement, 1 - F measure
  14. Before we do classifying, we need to choose a way to represent text for some classifiers - indexing All this work is classic Information Retrieval Bag of Words is so called because we discard the structure, and just note down appearances of words Throw away the ordering, any structure at all from web pages etc.
  15. First we have to get the words We can use a variety of methods for extracting tokens About the simplest would probably be something like this We dump all punctuation, everything but basic letters, and split on whitespace. For email, Pear::Mail_mimeDecode is good for extracting the message body We then represent each document as an array, where keys are all terms from all docs And values are whether that particular term present in this particular document This is the document vector
  16. Here is the collection of these two phrases as a vector. 1 if the word is in the document, 0 if not
  17. You can plot the documents on a graph Here the green circle is A the red triangle B I’ve bounced up 0 on the graph just to keep it away from the value labels So our previous document would actually be a point in 8 dimensional space As we have 8 terms Simple enough, but what we really want to do is capture a bit more information - a position on each axis So instead of storing just presence, we store ‘weight’, the value of the term
  18. TFIDF is a classic and very common weight - there are a lot of variations though TF is just percentage of document composed of term IDF is number of docs divided by number with term Gives less common terms a higher weight So best is uncommon term that appears a lot If we look at term weighting our previous by this
  19. The idf means that the ‘i’ and ‘like’ actually disappear here, as they are in all docs Normally that wouldn’t quite happen! But it shows they have no value to the document Don’t gets weighted higher. We’d then usually normalise this, to unit length, to account a bit for doc length differences
  20. There are unnecessary terms here though, I and Like Most algorithms look at all terms, so the increase number of term dimensions can be a problem
  21. The number of dimensions is the whole vocabulary, every words that’s been seen in any document DR or term space reduction is all about removing terms that don’t contribute much This can often be by a factor of 10 or 100!
  22. May have heard of stop words Common in search engines of old Words like ‘of’ ‘the’ ‘an’ - little to no semantic value to us Can use a list of words, or infer it from low idf scores Which would also pick up ‘collection’ stop words that are not necessarily english stop words E.g. if you were classifying documents about pokemon, the words pokemon would probably appear very frequently, and be of little value
  23. Try to come up with ‘root’ word Maps lots of different variations onto one term, reducing dimensions Result is usually not english, it’s just repeatable
  24. Kai-Square - greek not chinese Statistical technique - this is an example of one, but there are many, odds ratio, information gain etc. Keep only terms which are indicative of one class over another We counts up the four values - like truth table from before How many spam docs contain term etc. Looks for importance of term by class by seeing the difference between expected and actual scores Expected values for a cell are rowproduct + colproduct / total Then we look at the square of the difference, divided by the expected value And add all them up
  25. We plug the numbers into this formula, which is a one step way of doing the same thing Comes out with a number which isn’t particularly interesting absolutely But is interesting relatively we can calculate a probability of the events being unrelated using the area from this distribution Number is 1DF because there is one variable and one dependent
  26. Can work out the probability number from a chi-square distribution But for DR, can just use a threshold and remove words with less than that threshold P is the chance that variables are independent - so for > 10.83 we are 99.9% certain the variables are dependent, one changes with the other OK, so we’ve got a good set of data, now we need a classifier
  27. Series of term present/not present questions branches in tree Eventually ending in leaf classification nodes - this is a ‘yes or no’ result, there’s is no grading of similarity Easy to classify, and building algorithm pretty easy Recursive If all collection class, then leaf class Else, choose the best term to split on,and recurse on each branch But how does it determine best?
  28. Calculate entropy - section could be repeated for multiple classes Basically represents how many bits needed to encode the result of a random sequence given this split Easier to see on graph
  29. If 0 or 1, the sequence is all the same class, so no bits If 0.5 it’s 50/50 so you need 1 bit to encode each If less than that, you can use shorter codes for more common spam or ham And longer for less common, so average bit per item is lower
  30. Combine by looking for maximum information gain Entropy of current set minus the weighted entropies of the two new sets
  31. Final col is just entropy times proportion For example, in this example the split looks pretty good The with class is very biased one way But because it’s smaller the information gain isn’t massive
  32. Easy to implement recursive builder Gives us a tree in array format, which we could save by serialising Just need to traverse to classify
  33. An completely made up example of an output tree.
  34. Millions of ways to do this, of course Simple function to return leaf node Assumes document is as array of words
  35. Problem: if you go right to the end, the tree will probably be too specific to the training data Stop condition - min info gain - or pruning Use a separate ‘validation’ set to test effectiveness of tree at different depths Choose most effective DTs generate human interpretable rules - very handy BUT expensive to train, don’t handle loads of dimensions well, and often require rebuilding
  36. KNN is much cheaper at training time - as there is no training Uses the fact that we can regard these as vectors in a N-dimensional space
  37. Lets consider only 2 terms, here we have documents displayed with their weights in terms X and Y Documents of class triangle and class circle They seem to have a spatial cluster
  38. We can work out the class of the new one by looking at it’s nearest neighbours The K is how many we look for
  39. In this case K is three, and the nearest three, as you can see, are all green circles. Choosing K is kind of hard, you might try a few different values but it’s usually in the 10-30 doc range Only real challenge is comparing documents Here we can see we are looking at just the X and Y distance, this is the euclidean distance
  40. Very easy. Simply looking at the difference between one and the other Can actually do the whole thing in the database ! But, has some problems, so more common...
  41. Alternative measurement, goes to 1 for identical, 0 for orthogonal, -1 for opposite Easy to do with normalised vectors - just take dot product Covers some cases euclidean is less good at
  42. We’ve got two options when classifying - can count most common as in first loop But this system gives us a grading of matching, the distance Or we weight on how similar they are - on the assumption the best matches are most indicative here we’re just adding the similarity, the closer the match the higher the value Could get much more fancy with weighting schemes of course In multiple we might take any class that gets over a certain weight in fact. But, still a bit of a pain to do in PHP - compare to every training document Lots of ways to optimise, because search engines do a very similar job, similarity wise Why not use one?
  43. Search engines are usually not designed to take whole documents as queries So, some fudging needed, like looking at only subject lines Not necessarily great results, but very easy to implement Good for twitter, or shorter applications perhaps
  44. Just implementing K using the result limit Will also want to replace ? and * characters Or could add terms through the API Still, a bit of a sketchy classifier
  45. Flax is based on the open source Xapian engine, kind of like their Solr Has a similarity search that makes KNN ridiculously easy and very effective The version with the PHP client is in SVN trunk at the moment, but is stable
  46. This code creates a database, adds two fields to it, and indexes a document
  47. Very similar to lucene loop Except we add then remove a document to use similarity feature Gets good accuracy and is pretty fast. However, if we want to use this kind of technique and don’t have a flax handy, there is another related technique
  48. Instead of taking each value and comparing it We take the *average* of all the documents in each class And compare against that This works surprisingly well!
  49. Here we compute the centroid of all the class By summing the weights, and multiplying by 1/the count. You might do this in the database, pretty straightforward op. Called a Rocchio classifier because it’s based on a relevance feedback technique by Rocchio
  50. Quick and easy probability based classifier Very commonly used in spam checking Naive assumption is that words are independent - which is clearly not true Means that we don’t need an example for each combination of attributes, which is very helpful for docs! Bayes is good at very high dimensionality because of this
  51. Take this slow!! Read the pipe as ‘given’, pr as probability of All classes are using the same doc, and since we only care about most likely, we can drop that bit Prob of class is easy, can either work it out as a likelihood or just assume 0.5 (for binary) So we just have work out the probability of the document given the class, which we can treat as the product of the likelihood of it’s terms occurring given class
  52. We can look at the data itself to calculate the term likelihoods Simply looking at the conditional probability, the number of times that the term occurs along with the given class divided by the total appearances of that class
  53. We can calculate it in a SQL query if storing the data. Assuming we’ve stored the total count in the class count, and the class in class
  54. The independence assumptions lets us treat that as the product of the probabilities of each individual term given class. Here we calc it by looping over the terms in a doc, and times it by the prior probability - probably 0.5. This is multi-bernoulli bayes, there is also a version multinomial bayes which calculates likelihood based on relative term frequencies. For that we’d raise the likelihood to the power term freq (count), and likelihood is the sum of the counts of that word in each doc in class (+1) divided by the sum of counts of all words in class (+ num terms)
  55. To sum up, what we have here handles a wide variety of problems The first step is recognising that something is a classification problem - context spelling - author identification - intrusion detection - determining genes in DNA sections Then you just need to extract features from the docs And apply a learner. Hope that everyone has this in their mental toolbox for different kinds of challenges
  56. Thanks to the people who put their photos on flickr under Creative Commons And also thanks to Lorenzo Alberton who gave me advice on this talk
  57. Any questions?