Document Classification In PHP

Document Classiﬁcation
In PHP

@ianbarber - ian@ibuildings.com.......
http://joind.in/talk/view/587.......


Deﬁning The Task
Document Pre-processing
Term Selection
Algorithms

What is
Document Classiﬁcation?

Uses

Ian Barber / @ianbarber / ian@ibuildings.com......
Filter Organise Metadata

Filtering -
Binary Classiﬁcation

Organising -....
Single Label Classiﬁcation....

Metadata -
Multiple Label Classiﬁcation

Manual Rules Written
Domain Experts

Machine Learning -.....
Automatically Extract Rules.....

Classes

Training Test
Documents Documents

Evaluation

spam ham

true false
spam
positive positive
false true
ham
negative negative

Measures....

$accuracy =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision = $tp / ($tp + $fp);

$recall = $tp / ($tp + $fn);

$beta = 0.5;

$f =
(($beta + 1) * $precision * $recall)
/ (($beta * $precision) + $recall)

Fβ Measure....

Vector Space Model -
Bag Of Words

$doc = strtolower(strip_tags($doc));

$regex = '/[^a-z0-9']/';
$doc = preg_replace($regex, '', $doc);

$words = preg_split('/s+/', $doc);

Extract Tokens

A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew

i really like eggs cabbage and donʼt stew

A 1 1 1 1 0 0 0 0

B 1 0 1 0 1 1 1 1

2.00

1.00
i

0

-1.00
0 0.50 1.00 1.50 2.00
really

$tf
= $termCount / $wordCount;

$idf
= log($totalDocs
/ $docsWithTerm, 2);

$tfidf = $tf * $idf;

Term Weighting....



A 0 0.25 0 0.25 0 0 0 0

B 0 0 0 0 0.125 0.125 0.25 0.125



A 0 0.5 0 0.5 0 0 0 0

B 0 0 0 0 0.2 0.2 0.4 0.2

happening - happen.......
happens - happen. .....
happened - happen.......
http://tartarus.org/~martin/PorterStemmer.......

Stemming

spam ham
term $a $b
not term $c $d

Chi-Square....

$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);

$chisquare = (
$total * pow($diff, 2 ) /
(($a+$c) * ($b+$d) *
($a+$b) * ($c+$d));

Chi-Square 1DF....

p chi2.
0.1 2.71.
0.05 3.84.
0.01 6.63.
0.005 7.88.
0.001 10.83.

p - Value....

Decision Tree - ID3

?

✔ ?

✖ ✔

Entropy....

$entropy =
-( ($spam/$total)
* log($spam/$total, 2))
-( ($ham/$total)
* log($ham/$total, 2));

1.00

0.75
entropy

0.50

0.25

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
spam/total

Information Gain....

$gain =
$baseEntropy
-(($withCount/$total)* $withEntropy )
( -(($woutCount/$total)* $woutEntropy )

Split Entropy Proportion E*P

Base 50/50 1 1 1

With 20/5 0.722 0.25 0.1805

Without 30/45 0.97 0.75 0.7275

1 - With - Without = 0.092.

array('hello' =>
array(
0 => array('terry' =>
array (
0 => 'spam',
1 => array('everybody' =>
array(
0 => 'ham',
1 => 'spam'
)
)
)
),
1 => 'spam'
)
);

Classiﬁcation....
function classify($doc, $tree) {
if(is_string($tree)) {
return $tree;
}
$key = key($tree);
if(in_array($term, $doc)) {
return classify($doc, $tree[$key][0]);
} else {
return classify($doc, $tree[$key][1]);
}
}

Overﬁtting:....
Pruning or Stop Conditions....

Spam
Term X

Ham

Term Y

foreach($doca as $term => $tfidf) {
$distance +=
abs ( $tfidf - $docb[$term] );
}

Euclidean Distance....

Cosine Similarity....

foreach($doca as $term => $tfidf) {
$similarity +=
floatval($tfidf) *
floatval($docb[$term]);
}

foreach($scores as $s) {
$classes[$s['class']]++;
}

foreach($scores as $s){
$classes[$s['class']] += $s['sim'];
}

arsort($classes);
$class = key($classes);

Classifying....

Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
Zend_Search_Lucene_Field::Text(
'class', $class));
$doc->addField(
Zend_Search_Lucene_Field::UnStored(
'contents', $content));
$index->addDocument($doc);

Zend_Search_Lucene::setResultSetLimit(25);

$results = $index->find($content);
foreach($results as $result) {
$classes[$result->class] += 1;
}

arsort($classes);

Classifying with ZSL....

Flax/Xapian Search Service
http://www.ﬂax.co.uk.......

$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
'store' => true,
'exacttext’ => true));
$db->addField('contents', array(
'store' => false,
'freetext' => array('language'=>'en')));
$db->commit();

$db->addDocument(array(
'class' => $class,
'contents' => $document));
$db->commit();

$db->addDocument(
array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$db->deleteDocument('foo');
$db->commit();

foreach($results['results'] as $r) {
if($r['docid'] != 'foo') {
$classes[$r['data']['class'][0]] += 1;
}
}

arsort($classes);

Prototypes For Rocchio

$mul = 1 / $docsInClassCount;

foreach($classDocs as $tid => $tfidf) {
$prototype[$tid] += $mul * $tfidf;
}

Naive Bayes -
Probability Based Classiﬁer

Bayes Theorem
Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
Pr(Doc)

Pr(Class Doc) = Pr(Doc Class) * Pr(Class)

Likelihood Of Term Occurring
Given Class

word spam freq pr(word|spam) ham freq pr(word|ham)

register 1757 0.11 246 0.02

sent 487 0.03 4600 0.36

Estimating Likelihood
$this->db->query(quot;
INSERT INTO class_terms
(class, term, likelihood)
SELECT d.class, d.term,
count(*) / quot; . $classCount . quot;
FROM documents AS d
JOIN document_terms AS dt USING (did)
WHERE d.class = 'quot; . $class . quot;'quot;
);

Classifying A Document
foreach($classes as $class) {
$prob[$class] = 0.5; // assume prior

foreach($document as $term) {
$prob[$class] *=
$likely[$term][$class];
}
}

arsort($prob);
$class = key($prob);


Deﬁning The Problem
Document Processing
Term Selection
Algorithm

Image Credits
Title http://www.flickr.com/photos/themacinator/3499579760/
What is... http://www.flickr.com/photos/austinevan/1225274637/
Filter http://www.flickr.com/photos/benimoto/2913950616/
Organise http://www.flickr.com/photos/ellasdad/425813314/
Metadata http://www.flickr.com/photos/banky177/2282734063/
Manual http://www.flickr.com/photos/foundphotoslj/1134150364/
Automatic http://www.flickr.com/photos/29278394@N00/59538978/
Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/
Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/
Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/
Stop words http://www.flickr.com/photos/afroswede/22237769/
Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/
ID3 http://www.flickr.com/photos/tonythemisfit/2414239471
Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/
Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/
Conclusion http://www.flickr.com/photos/mukluk/241256203
Credits http://www.flickr.com/photos/librarianavengers/413762956/

Questions?

@ianbarber - ian@ibuildings.com.......
http://joind.in/talk/view/587.......

Document Classification In PHP

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Document Classification In PHP

Similaire à Document Classification In PHP (20)

Plus de Ian Barber

Plus de Ian Barber (10)

Dernier

Dernier (20)

Document Classification In PHP

Notes de l'éditeur