CSE 536 Text Processing
Prof M. Akif Eyler
Textbook: Witten, Moffat, Bell,
Managing Gigabytes 2nd Ed,
Morgan Kaufmann, 1999
Group page:
http://groups-beta.google.com/group/cse536
Computer applications of this course will be in Java language. Knowledge of Java is not required, but previous experience with any programming language and a strong desire to improve your object-oriented programming skills will be necessary. Regular weekly assignments and attendance are required. The use of various existing tools in Java class library as well as the development of new tools will be aimed.
| Feb 22 | 1. Overview of the course |
| Mar 1 | 2. Java classes related to text processing |
| Mar 8 | 3. Text Compression: Huffman coding |
| Mar 15 | 4. Text Compression: Adaptive methods |
| Mar 22 | 5. Dictionary models, ZIP and JAR files |
| Mar 29 | 6. Indexing: inverted files |
| Apr 5 | 7. Comparison of indexing methods |
| Apr 12 | 8. (midterm) |
| Apr 19 | 9. Querying: accessing the lexicon |
| Apr 26 | 10. Ranking and information retrieval |
| May 3 | 11. Cosine measure |
| May 10 | 12. Index construction |
| May 17 | 13. Web pages, HTML files |
| May 24 | 14. (Term Project Presentation) |
Grading
Midterm 25%
Assignments 20%
Term Project 20%
Final 35%
Term Project (individual work)
Design and implement a simple but powerful search system.
Your design should include indexing and querying components.
Ranking is necessary. Compression is optional.
Your corpus may be plain text, HTML, source code, etc.
A large number of documents should be supplied.
Paragraphs may be documents in case of a single file.
They may be files or methods, in case of source code.
Present your design and sample run.
Do not include source code.
Presentation: Tue, May 24 (five minutes)
Be ready for a demo