Category (en-gb)
Welcome to Corpus C4 – the joint corpus of the Digitales Wörterbuchs der deutschen Sprache des 20. Jahrhunderts (DWDS), the Austrian Academy Corpus (AAC), the Korpus Südtirol and the Swiss Text Corpus (CHTK).
You will find general information about Corpus C4 on these pages – as well as links to the corpus search interface.
Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)
DWDS aims at developing a dictionary system based on large and balanced electronic corpora. The project is planned to be realised in three consecutive phases: Building of text corpora, data processing with computational linguistic techniques and lexicographical work. The publicly available data is linguistically pre-processed and can be accessed via Internet (www.dwds.de). Since 2007, DWDS has been an Academy Project of the BBAW.
Austrian Academy Corpus (AAC)
The Austrian Academy Corpus (AAC) is compiled in Austria in a national corpus programme especially paying attention to the aspects of "preservation of cultural heritage" and "text technology". At its completion the AAC will contain about 1 billion running words.
Korpus Südtirol
The 'Korpus Südtirol' initiative aims at the collection, filing and corpus linguistic processing of South Tyrolean texts in order to make them available to the public and to document the use of written German in South Tyrol. It was started in September 2005 by researchers of the Free University of Bolzano, the European Academy Bozen/Bolzano and the University of Innsbruck.
By now, the text corpus thus developed allows for the evaluation of the available authentic language data according to different criteria, so as to be used for studies in historical linguistics, socio-linguistics, contact linguistics and language varieties. It will also serve as a basis for language teaching and consulting. The texts collected in South Tyrol, as well as further corpora in other languages, can be investigated through a user-friendly corpus search interface, which offers the opportunity to strengthen people's language awareness within the multilingual environment in South Tyrol.
Swiss Text Corpus (CHTK)
The Swiss sub-project Swiss Text Corpus (CHTK) contains texts in German written by Swiss authors in the 20th century. This digital collection is structured in analogy to the partner projects in Germany, Austria and Italy (using the same formal, time and content criteria). It is a balanced representation of standard German vocabulary in Switzerland and can serve as a base resource for specifically Swiss lexicographical needs.
The Swiss Text Corpus was built by a research group of the Deutsches Seminar of the University of Basel and was funded mainly by the Swiss National Science Foundation. Since 2014 it has been hosted by the Schweizerisches Idiotikon with financial support of the Swiss Academy of Humanities and Social Sciences.
Sub-corpus DWDS
|
documents/ words 1900-1924 |
documents/ words 1925-1949 |
documents/ words 1950-1974 |
documents/ words 1975-1999 |
total |
||||
functional texts |
51 |
1'003'921 |
14 |
1'016'655 |
22 |
1'002'770 |
13 |
1'002'308 |
100 |
science |
26 |
1'018'092 |
11 |
1'013'052 |
15 |
1'013'032 |
14 |
1'030'364 |
66 |
fiction |
58 |
1'507'331 |
49 |
1'500'076 |
14 |
1'501'462 |
22 |
1'500'048 |
143 |
journalistic texts |
NA |
1'508'012 |
NA |
1'503'202 |
NA |
1'500'002 |
NA |
1'500'000 |
NA |
total |
|
5'037'356 |
|
5'032'985 |
|
5'017'266 |
|
5'032'720 |
|
Sub-corpus AAC
|
documents/ words 1900-1924 |
documents/ words |
documents/ words |
documents/ words |
total |
||||
functional texts |
2 |
106'275 |
1 |
27'137 |
0 |
0 |
0 |
0 |
3 |
factual texts |
23 |
1'811'433 |
16 |
794'136 |
1 |
27'037 |
0 |
0 |
40 |
fiction |
4 |
153'754 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
journalistic texts |
2 |
363'309 |
4 |
355'019 |
0 |
0 |
0 |
0 |
6 |
|
5 |
320'344 |
2 |
92'399 |
1 |
65'275 |
0 |
0 |
6 |
total |
36 |
2'755'115 |
23 |
1'268'691 |
2 |
92'312 |
0 |
0 |
61 |
Sub-corpus Korpus Südtirol
|
documents/ words 1900-1924 |
documents/ words 1925-1949 |
documents/ words 1950-1974 |
documents/ words 1975-1999 |
total |
||||
functional texts |
946 |
259'842 |
0 |
0 |
8 |
152'209 |
2 |
10'280 |
956 |
factual texts |
0 |
0 |
0 |
0 |
3 |
70'955 |
3 |
126'489 |
6 |
fiction |
0 |
0 |
1 |
52'975 |
0 |
0 |
21 |
222'546 |
22 |
journalistic texts |
0 |
0 |
0 |
0 |
58 |
75'040 |
383 |
447'463 |
441 |
total |
946 |
259'842 |
1 |
52'975 |
69 |
298'204 |
409 |
806'778 |
1'425 |
Sub-corpus CHTK
|
documents/ words |
documents/ words |
documents/ words |
documents/ words |
total |
||||
functional texts |
1'042 |
1'170'099 |
1'465 |
1'267'731 |
969 |
1'193'200 |
1'417 |
1'087'395 |
4'893 |
factual texts |
167 |
1'450'562 |
433 |
2'052'909 |
804 |
1'954'529 |
276 |
1'891'373 |
1'680 |
fiction |
188 |
1'116'820 |
50 |
1'248'911 |
159 |
1'122'447 |
59 |
1'149'111 |
456 |
journalistic texts |
833 |
513'728 |
1'107 |
1'020'160 |
993 |
982'098 |
1'929 |
1'135'426 |
4'862 |
total |
2'230 |
4'251'209 |
3'055 |
5'589'711 |
2'925 |
5'252'274 |
3'681 |
5'263'305 |
11'891
|
Project structure
Members of the Corpus-C4 initiative are the Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS) from Berlin, the Austrian Academy Corpus (AAC) from Vienna, the Korpus Südtirol from Bolzano/Bozen and the Swiss Text Corpus (CHTK) from Basel/Zurich.
The project aims at providing access to a balanced corpus of 20th century Standard German, especially taking into account regional variation. At the moment, the corpus contains 20 million running words from DWDS, 4.1 million words from AAC, 1.7 million words from Korpus Südtirol and 20 million words from CHTK.
Corpus query system
A special technical feature of Corpus C4 is its distributed query system. All sub-project host their data on their own servers and only at query time results are collected from the different servers and put together on a single page. Hence, the Corpus C4 as a whole is a virtual corpus. From a technological view this is mainly achieved using the functions of the linguistic search engine DDC developed by the Berlin sub-project DWDS.