Marmara University

CSE 536 Text Processing

Prof M. Akif Eyler

Textbook: Witten, Moffat, Bell, Managing Gigabytes 2nd Ed, Morgan Kaufmann, 1999
Group page: http://groups-beta.google.com/group/cse536

Computer applications of this course will be in Java language. Knowledge of Java is not required, but previous experience with any programming language and a strong desire to improve your object-oriented programming skills will be necessary. Regular weekly assignments and attendance are required. The use of various existing tools in Java class library as well as the development of new tools will be aimed.

Weekly Outline

Feb 22 1. Overview of the course
Mar 1 2. Java classes related to text processing
Mar 8 3. Text Compression: Huffman coding
Mar 15 4. Text Compression: Adaptive methods
Mar 22 5. Dictionary models, ZIP and JAR files
Mar 29 6. Indexing: inverted files
Apr 5 7. Comparison of indexing methods
Apr 12 8. (midterm)
Apr 19 9. Querying: accessing the lexicon
Apr 26 10. Ranking and information retrieval
May 3 11. Cosine measure
May 10 12. Index construction
May 17 13. Web pages, HTML files
May 24 14. (Term Project Presentation)

Grading
Midterm 25%
Assignments 20%
Term Project 20%
Final 35%

Term Project (individual work)
Design and implement a simple but powerful search system.
Your design should include indexing and querying components.
Ranking is necessary. Compression is optional.

Your corpus may be plain text, HTML, source code, etc.
A large number of documents should be supplied.
Paragraphs may be documents in case of a single file.
They may be files or methods, in case of source code.

Present your design and sample run.
Do not include source code.

Presentation: Tue, May 24 (five minutes)
Be ready for a demo

Last update: May 2005